DataFrame

class getml.data.DataFrame(name, roles=None)

Bases: object

Handler for the data stored in the getML engine.

The DataFrame class represents a data frame object in the getML engine but does not contain any actual data itself. To create such a data frame object, fill it with data via the Python API, and to retrieve a handler for it, you can use one of the from_csv(), from_db(), from_json(), or from_pandas() class methods. The Importing data section in the user guide does explain in detail particularities of each of those flavors of the unified import interface.

In case the data frame object is already present in the engine - either in memory as a temporary object or on disk when save() was called earlier -, the load_data_frame() function will create a new handler without altering the underlying data. For more information about the lifecycle of the data in the getML engine and its synchronization with the Python API please see the corresponding user guide.

Parameters
  • name (str) – Unique identifier used to link the handler with the underlying data frame object in the engine.

  • roles (dict[str, List[str]], optional) –

    A dictionary mapping the roles to the column names (see colnames()).

    The roles dictionary is expected to have the following format

    roles = {getml.data.role.numeric: ["colname1", "colname2"],
             getml.data.role.target: ["colname3"]}
    

Raises
  • TypeError – If any of the input arguments is of wrong type.

  • ValueError – If one of the provided keys in roles does not match a definition in roles.

Examples

Creating a new data frame object in the getML engine and importing data is done by one the class functions from_csv(), from_db(), from_json(), or from_pandas().

random = numpy.random.RandomState(7263)

table = pandas.DataFrame()
table['column_01'] = random.randint(0, 10, 1000).astype(numpy.str)
table['join_key'] = numpy.arange(1000)
table['time_stamp'] = random.rand(1000)
table['target'] = random.rand(1000)

df_table = getml.data.DataFrame.from_pandas(table, name = 'table')

In addition to creating a new data frame object in the getML engine and filling it with all the content of table, the from_pandas() function does also return a DataFrame handler to the underlying data.

You don’t have to create the data frame objects anew for each session. You can use their save() method to write them to disk, the list_data_frames() function to list all available objects in the engine, and load_data_frame() to create a DataFrame handler for a data set already present in the getML engine (see Lifecycles and synchronization between engine and API for details).

df_table.save()

getml.data.list_data_frames()

df_table_reloaded = getml.data.load_data_frame('table')

Note

Although the Python API does not store the actual data itself, you can use the to_csv(), to_db(), to_json(), and to_pandas() methods to retrieve them.

Attributes Summary

categorical_names

List of the names of all categorical columns.

colnames

List of the names of all columns.

join_key_names

List of the names of all join keys.

n_categorical

Number of categorical columns.

n_join_keys

Number of join keys.

n_numerical

Number of numerical columns.

n_targets

Number of target columns.

n_time_stamps

Number of time stamps columns.

n_unused

Number of unused columns.

n_unused_floats

Number of unused float columns.

n_unused_strings

Number of unused string columns.

numerical_names

List of the names of all numerical columns.

roles

The roles of the columns included in this DataFrame.

shape

A tuple containing the number of rows and columns of the DataFrame.

target_names

List of the names of all target columns.

time_stamp_names

List of the names of all time stamps.

unused_float_names

List of the names of all unused float columns.

unused_names

List of the names of all unused columns.

unused_string_names

List of the names of all unused string columns.

Methods Summary

add(col, name[, role, unit, time_formats])

Adds a column to the current DataFrame.

copy(name)

Creates a deep copy of the data frame under a new name.

delete([mem_only])

Deletes the data frame from the getML engine.

drop(name)

Remove the column identified by name.

from_csv(fnames, name[, num_lines_sniffed, …])

Create a DataFrame from CSV files.

from_db(table_name[, name, roles, ignore, …])

Create a DataFrame from a table in a database.

from_dict(data, name[, roles, ignore, dry])

Create a new DataFrame from a dict

from_json(json_str, name[, roles, ignore, dry])

Create a new DataFrame from a JSON string.

from_pandas(pandas_df, name[, roles, …])

Create a DataFrame from a pandas.DataFrame.

from_s3(bucket, keys, region, name[, …])

Create a DataFrame from CSV files located in an S3 bucket.

group_by(key, name, aggregations)

Creates new DataFrame by grouping over a join key.

join(name, other, join_key[, …])

Create a new DataFrame by joining the current instance with another DataFrame.

load()

Loads saved data from disk.

n_bytes()

Size of the data stored in the underlying data frame in the getML engine.

n_cols()

Number of columns in the current instance.

n_rows()

Number of rows in the current instance.

num_column(value)

Generates a float or integer column that consists solely of a single entry.

random([seed])

Create random column.

read_csv(fnames[, append, quotechar, sep, …])

Read CSV files.

read_db(table_name[, append, conn])

Fill from Database.

read_json(json_str[, append, time_formats])

Fill from JSON

read_pandas(pandas_df[, append])

Uploads a pandas.DataFrame.

read_query(query[, append, conn])

Fill from query

read_s3(bucket, keys, region[, append, sep, …])

Read CSV files from an S3 bucket.

refresh()

Aligns meta-information of the current instance with the corresponding data frame in the getML engine.

rowid()

Get the row numbers of the table.

save()

Writes the underlying data in the getML engine to disk.

set_role(names, role[, time_formats])

Assigns a new role to one or more columns.

set_unit(names, unit[, comparison_only])

Assigns a new unit to one or more columns.

string_column(value)

Generates a string column that consists solely of a single entry.

to_csv(fname[, quotechar, sep, batch_size])

Writes the underlying data into a newly created CSV file.

to_db(table_name[, conn])

Writes the underlying data into a newly created table in the database.

to_html([max_rows])

Represents the data frame in HTML format, optimized for an iPython notebook.

to_json()

Creates a JSON string from the current instance.

to_pandas()

Creates a pandas.DataFrame from the current instance.

to_placeholder()

Generates a Placeholder from the current DataFrame.

to_s3(bucket, key, region[, sep, batch_size])

Writes the underlying data into a newly created CSV file located in an S3 bucket.

where(name, condition)

Extract a subset of rows.

Attributes Documentation

categorical_names

List of the names of all categorical columns.

Returns

List of the names of all categorical columns.

Return type

List[str]

colnames

List of the names of all columns.

Returns

List of the names of all columns.

Return type

List[str]

join_key_names

List of the names of all join keys.

Returns

List of the names of all columns used as join keys.

Return type

List[str]

n_categorical

Number of categorical columns.

Returns

Number of categorical columns

Return type

int

n_join_keys

Number of join keys.

Returns

Number of columns used as join keys

Return type

int

n_numerical

Number of numerical columns.

Returns

Number of numerical columns

Return type

int

n_targets

Number of target columns.

Returns

Number of columns used as targets

Return type

int

n_time_stamps

Number of time stamps columns.

Returns

Number of columns used as time stamps

Return type

int

n_unused

Number of unused columns. Unused columns will not be used by the feature learning algorithms.

Returns

Number of columns that are unused.

Return type

int

n_unused_floats

Number of unused float columns. Unused columns will not be used by the feature learning algorithms.

Returns

Number of columns that are unused.

Return type

int

n_unused_strings

Number of unused string columns. Unused columns will not be used by the feature learning algorithms.

Returns

Number of columns that are unused.

Return type

int

numerical_names

List of the names of all numerical columns.

Returns

List of the names of all numerical columns.

Return type

List[str]

roles

The roles of the columns included in this DataFrame.

shape

A tuple containing the number of rows and columns of the DataFrame.

target_names

List of the names of all target columns.

Returns

List of the names of all columns used as target.

Return type

List[str]

time_stamp_names

List of the names of all time stamps.

Returns

List of the names of all columns used as time stamp.

Return type

List[str]

unused_float_names

List of the names of all unused float columns. Unused columns will not be used by the feature learning algorithms.

Returns

List of the names of all columns that are unused.

Return type

List[str]

unused_names

List of the names of all unused columns. Unused columns will not be used by the feature learning algorithms.

Returns

List of the names of all columns that are unused.

Return type

List[str]

unused_string_names

List of the names of all unused string columns. Unused columns will not be used by the feature learning algorithms.

Returns

List of the names of all columns that are unused.

Return type

List[str]

Methods Documentation

add(col, name, role=None, unit='', time_formats=None)

Adds a column to the current DataFrame.

Parameters
  • col (column or numpy.ndarray) – The column or numpy.ndarray to be added.

  • name (str) – Name of the new column.

  • role (str, optional) –

    Role of the new column. Must be one of the following:

  • unit (str, optional) – Unit of the column.

  • time_formats (str, optional) –

    Formats to be used to parse the time stamps.

    This is only necessary, if an implicit conversion from a StringColumn to a time stamp is taking place.

    The formats are allowed to contain the following special characters:

    • %w - abbreviated weekday (Mon, Tue, …)

    • %W - full weekday (Monday, Tuesday, …)

    • %b - abbreviated month (Jan, Feb, …)

    • %B - full month (January, February, …)

    • %d - zero-padded day of month (01 .. 31)

    • %e - day of month (1 .. 31)

    • %f - space-padded day of month ( 1 .. 31)

    • %m - zero-padded month (01 .. 12)

    • %n - month (1 .. 12)

    • %o - space-padded month ( 1 .. 12)

    • %y - year without century (70)

    • %Y - year with century (1970)

    • %H - hour (00 .. 23)

    • %h - hour (00 .. 12)

    • %a - am/pm

    • %A - AM/PM

    • %M - minute (00 .. 59)

    • %S - second (00 .. 59)

    • %s - seconds and microseconds (equivalent to %S.%F)

    • %i - millisecond (000 .. 999)

    • %c - centisecond (0 .. 9)

    • %F - fractional seconds/microseconds (000000 - 999999)

    • %z - time zone differential in ISO 8601 format (Z or +NN.NN)

    • %Z - time zone differential in RFC format (GMT or +NNNN)

    • %% - percent sign

Raises
  • TypeError – If any of the input arguments is of wrong type.

  • ValueError – If one of the provided keys in roles does not match a definition in roles.

copy(name)

Creates a deep copy of the data frame under a new name.

Parameters

name (str) – The name of the new data frame.

Returns

A handle to the deep copy.

Return type

DataFrame

delete(mem_only=False)

Deletes the data frame from the getML engine.

If called with the mem_only option set to True, the data frame corresponding to the handler represented by the current instance can be reloaded using the load() method.

Parameters

mem_only (bool, optional) – If True, the data frame will not be deleted permanently but just from memory (RAM).

Raises

TypeError – If any of the input arguments is of wrong type.

drop(name)

Remove the column identified by name.

Parameters

name (str) – Name of the column to be removed. Must match exactly one column in the current instance.

Returns

Updated version of the current instance.

Return type

DataFrame

classmethod from_csv(fnames, name, num_lines_sniffed=1000, num_lines_read=0, quotechar='"', sep=',', skip=0, colnames=None, roles=None, ignore=False, dry=False)

Create a DataFrame from CSV files.

The fastest way to import data into the getML engine is to read it directly from CSV files. It will construct a data frame object in the engine, fill it with the data read from the CSV file(s), and return a corresponding DataFrame handle.

Parameters
  • fnames (List[str]) – CSV file paths to be read.

  • name (str) – Name of the data frame to be created.

  • num_lines_sniffed (int, optional) – Number of lines analyzed by the sniffer.

  • num_lines_read (int, optional) – Number of lines read from each file. Set to 0 to read in the entire file.

  • quotechar (str, optional) – The character used to wrap strings.

  • sep (str, optional) – The separator used for separating fields.

  • skip (int, optional) – Number of lines to skip at the beginning of each file.

  • colnames (List[str] or None, optional) – The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.

  • roles (dict[str, List[str]], optional) –

    A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the CSV files. The roles dictionary should be in the following format:

    >>> roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
    

  • ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?

  • dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.

Raises
  • TypeError – If any of the input arguments is of a wrong type.

  • ValueError – If one of the provided keys in roles does not match a definition in roles.

Returns

Handler of the underlying data.

Return type

DataFrame

Note

The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called save() method.

It is assumed that the first line of each CSV file contains a header with the column names.

In addition to reading data from a CSV file, you can also write an existing DataFrame back into one using to_csv() or replace/append to the current instance using the read_csv() method.

Examples

Let’s assume you have two CSV files - file1.csv and file2.csv - in the current working directory. You can import their data into the getML engine using.

>>> df_expd = data.DataFrame.from_csv(
...     fnames=["file1.csv", "file2.csv"],
...     name="MY DATA FRAME",
...     sep=';',
...     quotechar='"'
... )

However, the CSV format lacks type safety. If you want to build a reliable pipeline, it is a good idea to hard-code the roles:

>>> roles = {"categorical": ["colname1", "colname2"], "target": ["colname3"]}
>>>
>>> df_expd = data.DataFrame.from_csv(
...         fnames=["file1.csv", "file2.csv"],
...         name="MY DATA FRAME",
...         sep=';',
...         quotechar='"',
...         roles=roles
... )

If you think that typing out all of the roles by hand is too cumbersome, you can use a dry run:

>>> roles = data.DataFrame.from_csv(
...         fnames=["file1.csv", "file2.csv"],
...         name="MY DATA FRAME",
...         sep=';',
...         quotechar='"',
...         dry=True
... )

This will return the roles dictionary it would have used. You can now hard-code this.

classmethod from_db(table_name, name=None, roles=None, ignore=False, dry=False, conn=None)

Create a DataFrame from a table in a database.

It will construct a data frame object in the engine, fill it with the data read from table table_name in the connected database (see database), and return a corresponding DataFrame handle.

Parameters
  • table_name (str) – Name of the table to be read.

  • name (str) – Name of the data frame to be created. If not passed, then the table_name will be used.

  • roles (dict[str, List[str]], optional) –

    A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the table. The roles dictionary should be in the following format:

    >>> roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
    

  • ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?

  • dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.

  • conn (Connection, optional) – The database connection to be used. If you don’t explicitly pass a connection, the engine will use the default connection.

Raises
  • TypeError – If any of the input arguments is of a wrong type.

  • ValueError – If one of the provided keys in roles does not match a definition in roles.

Returns

Handler of the underlying data.

Return type

DataFrame

Note

The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called save() method.

In addition to reading data from a table, you can also write an existing DataFrame back into a new one in the same database using to_db() or replace/append to the current instance using the read_db() or read_query() method.

Example

getml.database.connect_mysql(
    host="relational.fit.cvut.cz",
    port=3306,
    dbname="financial",
    user="guest",
    password="relational"
)

loan = getml.data.DataFrame.from_db(table_name='loan', name='data_frame_loan')
classmethod from_dict(data, name, roles=None, ignore=False, dry=False)

Create a new DataFrame from a dict

Parameters
  • data (dict) –

    The dict containing the data. The data should be in the following format:

    data = {'col1': [1.0, 2.0, 1.0], 'col2': ['A', 'B', 'C']}
    

  • name (str) – Name of the data frame to be created.

  • roles (dict[str, List[str]], optional) –

    A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the string. The roles dictionary should be in the following format:

    roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
    

  • ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?

  • dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.

Raises
  • TypeError – If any of the input arguments is of a wrong type.

  • ValueError – If one of the provided keys in roles does not match a definition in roles.

Returns

Handler of the underlying data.

Return type

DataFrame

classmethod from_json(json_str, name, roles=None, ignore=False, dry=False)

Create a new DataFrame from a JSON string.

It will construct a data frame object in the engine, fill it with the data read from the JSON string, and return a corresponding DataFrame handle.

Parameters
  • json_str (str) –

    The JSON string containing the data. The json_str should be in the following format:

    json_str = "{'col1': [1.0, 2.0, 1.0], 'col2': ['A', 'B', 'C']}"
    

  • name (str) – Name of the data frame to be created.

  • roles (dict[str, List[str]], optional) –

    A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the string. The roles dictionary should be in the following format:

    roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
    

  • ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?

  • dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.

Raises
  • TypeError – If any of the input arguments is of a wrong type.

  • ValueError – If one of the provided keys in roles does not match a definition in roles.

Returns

Handler of the underlying data.

Return type

DataFrame

Note

The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called save() method.

In addition to reading data from a JSON string, you can also write an existing DataFrame back into one using to_json() or replace/append to the current instance using the read_json() method.

classmethod from_pandas(pandas_df, name, roles=None, ignore=False, dry=False)

Create a DataFrame from a pandas.DataFrame.

It will construct a data frame object in the engine, fill it with the data read from the pandas.DataFrame, and return a corresponding DataFrame handle.

Parameters
  • pandas_df (pandas.DataFrame) – The table to be read.

  • name (str) – Name of the data frame to be created.

  • roles (dict[str, List[str]], optional) –

    A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the pandas.DataFrame. The roles dictionary should be in the following format:

    roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
    

  • ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?

  • dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.

Raises
  • TypeError – If any of the input arguments is of a wrong type.

  • ValueError – If one of the provided keys in roles does not match a definition in roles.

Returns

Handler of the underlying data.

Return type

DataFrame

Note

The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called save() method.

In addition to reading data from a pandas.DataFrame, you can also write an existing DataFrame back into one using to_pandas() or replace/append to the current instance using the read_pandas() method.

classmethod from_s3(bucket, keys, region, name, num_lines_sniffed=1000, num_lines_read=0, sep=',', skip=0, colnames=None, roles=None, ignore=False, dry=False)

Create a DataFrame from CSV files located in an S3 bucket.

NOTE THAT S3 IS NOT SUPPORTED ON WINDOWS.

This classmethod will construct a data frame object in the engine, fill it with the data read from the CSV file(s), and return a corresponding DataFrame handle.

Parameters
  • bucket (str) – The bucket from which to read the files.

  • keys (List[str]) – The list of keys (files in the bucket) to be read.

  • region (str) – The region in which the bucket is located.

  • name (str) – Name of the data frame to be created.

  • num_lines_sniffed (int, optional) – Number of lines analyzed by the sniffer.

  • num_lines_read (int, optional) – Number of lines read from each file. Set to 0 to read in the entire file.

  • sep (str, optional) – The separator used for separating fields.

  • skip (int, optional) – Number of lines to skip at the beginning of each file.

  • colnames (List[str] or None, optional) – The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.

  • roles (dict[str, List[str]], optional) –

    A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the CSV files. The roles dictionary should be in the following format:

    >>> roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
    

  • ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?

  • dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.

Raises
  • TypeError – If any of the input arguments is of a wrong type.

  • ValueError – If one of the provided keys in roles does not match a definition in roles.

Returns

Handler of the underlying data.

Return type

DataFrame

Note

The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called save() method.

It is assumed that the first line of each CSV file contains a header with the column names.

Example

Let’s assume you have two CSV files - file1.csv and file2.csv - in the bucket. You can import their data into the getML engine using the following commands:

>>> getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
>>>
>>> getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")
>>>
>>> data_frame_expd = data.DataFrame.from_s3(
...         bucket="your-bucket-name",
...         keys=["file1.csv", "file2.csv"],
...         region="us-east-2",
...         name="MY DATA FRAME",
...         sep=';'
... )

You can also set the access credential as environment variables before you launch the getML engine.

Also refer to the documention on from_csv() for further information on overriding the CSV sniffer for greater type safety.

group_by(key, name, aggregations)

Creates new DataFrame by grouping over a join key.

This function split the DataFrame into groups with the same value for join_key, applies an aggregation function to one or more columns in each group, and combines the results into a new DataFrame. The aggregation funcion is defined for each column individually. This allows applying different aggregations to each column. In pandas this is known as named aggregation.

Parameters
  • key (str) – Name of the key to group by. If the key is a join key, the group_by will be faster, because join keys already have an index, whereas all other columns need to have an index built for the group_by.

  • name (str) – Name of the new DataFrame.

  • aggregations (List[Aggregation]) – Methods to apply on the groupings.

Raises

TypeError – If any of the input arguments is of wrong type.

Returns

Handler of the newly generated data frame object.

Return type

DataFrame

Examples

Generate example data

data = dict(
    fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"],
    price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4],
    join_key=["0", "1", "2", "2", "3", "3"]
)
df = getml.data.DataFrame.from_dict(
    data,
    name="fruits",
    roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]}
)

df
| join_key | fruit       | price     |
| join key | categorical | numerical |
--------------------------------------
| 0        | banana      | 2.4       |
| 1        | apple       | 3         |
| 2        | cherry      | 1.2       |
| 2        | cherry      | 1.4       |
| 3        | melon       | 3.4       |
| 3        | pineapple   | 3.4       |

Group DataFrame using join_key. Aggregate the resulting groups by averaging and summing over the price column and counting the distinct entires in the fruit column

df_grouped = df.group_by("join_key", "fruits_grouped",
    [df["price"].avg(alias="avg price"),
    df["price"].sum(alias="total price"),
    df["fruit"].count_distinct(alias="unique items")])

df_grouped
| join_key | avg price | total price | unique items |
| join key | unused    | unused      | unused       |
-----------------------------------------------------
| 3        | 3.4       | 6.8         | 2            |
| 2        | 1.3       | 2.6         | 1            |
| 0        | 2.4       | 2.4         | 1            |
| 1        | 3         | 3           | 1            |
join(name, other, join_key, other_join_key=None, cols=None, other_cols=None, how='inner', where=None)

Create a new DataFrame by joining the current instance with another DataFrame.

Parameters
  • name (str) – The name of the new DataFrame.

  • other (DataFrame) – The other DataFrame.

  • join_key (str) – Name of the column containing the join key in the current instance.

  • other_join_key (str, optional) – Name of the join key in the other DataFrame. If set to None, join_key will be used for both the current instance and other.

  • (List[Union[ (other_cols) –

    FloatColumn, StringFloatColumn], optional):

    columns in the current instances to be included in the resulting DataFrame. If set to None, all columns will be used.

  • (List[Union[

    FloatColumn, StringColumn], optional):

    columns in other to be included in the resulting DataFrame. If set to None, all columns will be used.

  • how (str, optional) –

    Type of the join.

    Supported options:

    • ’left’

    • ’inner’

    • ’right’

  • where (VirtualBooleanColumn, optional) –

    Boolean column indicating which rows to be included in the resulting DataFrame. If set to None, all rows will be used.

    If imposes a SQL-like WHERE condition on the join.

Raises

TypeError – If any of the input arguments is of wrong type.

Returns

Handler of the newly create data frame object.

Return type

DataFrame

Examples

Create DataFrame

data_df = dict(
    colors=["blue", "green", "yellow", "orange"],
    numbers=[2.4, 3.0, 1.2, 1.4],
    join_key=["0", "1", "2", "3"]
)

df = getml.data.DataFrame.from_dict(
    data_df, name="df_1",
    roles=dict(join_key=["join_key"], numerical=["numbers"], categorical=["colors"]))

df
| join_key | colors      | numbers   |
| join key | categorical | numerical |
--------------------------------------
| 0        | blue        | 2.4       |
| 1        | green       | 3         |
| 2        | yellow      | 1.2       |
| 3        | orange      | 1.4       |

Create other Data Frame

data_other = dict(
    colors=["blue", "green", "yellow", "black", "orange", "white"],
    numbers=[2.4, 3.0, 1.2, 1.4, 3.4, 2.2],
    join_key=["0", "1", "2", "2", "3", "4"])

other = getml.data.DataFrame.from_dict(
    data_other, name="df_2",
    roles=dict(join_key=["join_key"], numerical=["numbers"], categorical=["colors"]))

other
| join_key | colors      | numbers   |
| join key | categorical | numerical |
--------------------------------------
| 0        | blue        | 2.4       |
| 1        | green       | 3         |
| 2        | yellow      | 1.2       |
| 2        | black       | 1.4       |
| 3        | orange      | 3.4       |
| 4        | white       | 2.2       |

Left join the two DataFrames on their join key, while keeping the columns ‘colors’ and ‘numbers’ from the first one and the column ‘colors’ as ‘other_color’ from the second one. As subcondition only rows are selected where the ‘number’ columns are equal.

joined_df = df.join(
    name="joined_df",
    other=other,
    how="left",
    join_key="join_key",
    cols=[df["colors"], df["numbers"]],
    other_cols=[other["colors"].alias("other_color")],
    where=(df["numbers"] == other["numbers"]))

joined_df
| colors      | other_color | numbers   |
| categorical | categorical | numerical |
-----------------------------------------
| blue        | blue        | 2.4       |
| green       | green       | 3         |
| yellow      | yellow      | 1.2       |
load()

Loads saved data from disk.

The data frame object holding the same name as the current DataFrame instance will be loaded from disk into the getML engine and updates the current handler using refresh().

Examples

Firstly, we have to create and imporimport data sets.

d, _ = getml.datasets.make_numerical(population_name = 'test')
getml.data.list_data_frames()

In the output of list_data_frames() we can find our underlying data frame object ‘test’ listed under the ‘in_memory’ key (it was created and imported by make_numerical()). This means the getML engine does only hold it in memory (RAM) yet and we still have to save() it to disk in order to load() it again or to prevent any loss of information between different sessions.

d.save()
getml.data.list_data_frames()
d2 = getml.data.DataFrame(name = 'test').load()
Returns

Updated handle the underlying data frame in the getML engine.

Return type

DataFrame

Note

When invoking load() all changes of the underlying data frame object that took place after the last call to the save() method will be lost. This methods, thus, enables you to undo changes applied to the DataFrame.

d, _ = getml.datasets.make_numerical()
d.save()

# Accidental change we want to undo
d.rm('column_01')

d.load()

If save() hasn’t be called on the current instance yet or it wasn’t stored to disk in a previous session, load() will throw an exception

File or directory ‘../projects/X/data/Y/’ not found!

Alternatively, load_data_frame() offers an easier way of creating DataFrame handlers to data in the getML engine.

n_bytes()

Size of the data stored in the underlying data frame in the getML engine.

Raises

Exception – If the data frame corresponding to the current instance could not be found in the getML engine.

Returns

Size of the underlying object in bytes.

Return type

numpy.uint64

n_cols()

Number of columns in the current instance.

Returns

Overall number of columns

Return type

int

n_rows()

Number of rows in the current instance.

Raises

Exception – If the data frame corresponding to the current instance could not be found in the getML engine.

Returns

Overall number of rows

Return type

numpy.int32

num_column(value)

Generates a float or integer column that consists solely of a single entry.

Parameters

value (float) – The value to be used.

Raises

TypeError – If any of the input arguments is of wrong type.

Returns

FloatColumn consisting of the singular entry.

Return type

VirtualFloatColumn

random(seed=5849)

Create random column.

The numbers will uniformly distributed from 0.0 to 1.0. This can be used to randomly split a population table into a training and a test set

Parameters

seed (int) – Seed used for the random number generator.

Returns

FloatColumn containing random numbers

Return type

VirtualFloatColumn

Example

population = getml.data.DataFrame('population')
population.add(numpy.zeros(100), 'column_01')
print(len(population))
100
idx = population.random(seed=42)
population_train = population.where("population_train", idx > 0.7)
population_test = population.where("population_test", idx <= 0.7)
print(len(population_train), len(population_test))
27 73
read_csv(fnames, append=False, quotechar='"', sep=',', num_lines_read=0, skip=0, colnames=None, time_formats=None)

Read CSV files.

It is assumed that the first line of each CSV file contains a header with the column names.

Parameters
  • fnames (List[str]) – CSV file paths to be read.

  • append (bool, optional) – If a data frame object holding the same name is already present in the getML, should the content of of the CSV files in fnames be appended or replace the existing data?

  • quotechar (str, optional) – The character used to wrap strings.

  • sep (str, optional) – The separator used for separating fields.

  • num_lines_read (int, optional) – Number of lines read from each file. Set to 0 to read in the entire file.

  • skip (int, optional) – Number of lines to skip at the beginning of each file.

  • colnames (List[str] or None, optional) – The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.

  • time_formats (List[str], optional) –

    The list of formats tried when parsing time stamps.

    The formats are allowed to contain the following special characters:

    • %w - abbreviated weekday (Mon, Tue, …)

    • %W - full weekday (Monday, Tuesday, …)

    • %b - abbreviated month (Jan, Feb, …)

    • %B - full month (January, February, …)

    • %d - zero-padded day of month (01 .. 31)

    • %e - day of month (1 .. 31)

    • %f - space-padded day of month ( 1 .. 31)

    • %m - zero-padded month (01 .. 12)

    • %n - month (1 .. 12)

    • %o - space-padded month ( 1 .. 12)

    • %y - year without century (70)

    • %Y - year with century (1970)

    • %H - hour (00 .. 23)

    • %h - hour (00 .. 12)

    • %a - am/pm

    • %A - AM/PM

    • %M - minute (00 .. 59)

    • %S - second (00 .. 59)

    • %s - seconds and microseconds (equivalent to %S.%F)

    • %i - millisecond (000 .. 999)

    • %c - centisecond (0 .. 9)

    • %F - fractional seconds/microseconds (000000 - 999999)

    • %z - time zone differential in ISO 8601 format (Z or +NN.NN)

    • %Z - time zone differential in RFC format (GMT or +NNNN)

    • %% - percent sign

Raises

TypeError – If any of the input arguments is of wrong type.

Returns

Handler of the underlying data.

Return type

DataFrame

read_db(table_name, append=False, conn=None)

Fill from Database.

The DataFrame will be filled from a table in the database.

Parameters
  • table_name (str) – Table from which we want to retrieve the data.

  • append (bool, optional) – If a data frame object holding the same name is already present in the getML, should the content of table_name be appended or replace the existing data?

  • conn (Connection, optional) – The database connection to be used. If you don’t explicitly pass a connection, the engine will use the default connection.

Raises

TypeError – If any of the input arguments is of wrong type.

Returns

Handler of the underlying data.

Return type

DataFrame

read_json(json_str, append=False, time_formats=None)

Fill from JSON

Fills the data frame with data from a JSON string.

Parameters
  • json_str (str) – The JSON string containing the data.

  • append (bool, optional) – If a data frame object holding the same name is already present in the getML, should the content of json_str be appended or replace the existing data?

  • time_formats (List[str], optional) –

    The list of formats tried when parsing time stamps.

    The formats are allowed to contain the following special characters:

    • %w - abbreviated weekday (Mon, Tue, …)

    • %W - full weekday (Monday, Tuesday, …)

    • %b - abbreviated month (Jan, Feb, …)

    • %B - full month (January, February, …)

    • %d - zero-padded day of month (01 .. 31)

    • %e - day of month (1 .. 31)

    • %f - space-padded day of month ( 1 .. 31)

    • %m - zero-padded month (01 .. 12)

    • %n - month (1 .. 12)

    • %o - space-padded month ( 1 .. 12)

    • %y - year without century (70)

    • %Y - year with century (1970)

    • %H - hour (00 .. 23)

    • %h - hour (00 .. 12)

    • %a - am/pm

    • %A - AM/PM

    • %M - minute (00 .. 59)

    • %S - second (00 .. 59)

    • %s - seconds and microseconds (equivalent to %S.%F)

    • %i - millisecond (000 .. 999)

    • %c - centisecond (0 .. 9)

    • %F - fractional seconds/microseconds (000000 - 999999)

    • %z - time zone differential in ISO 8601 format (Z or +NN.NN)

    • %Z - time zone differential in RFC format (GMT or +NNNN)

    • %% - percent sign

Raises

TypeError – If any of the input arguments is of wrong type.

Returns

Handler of the underlying data.

Return type

DataFrame

read_pandas(pandas_df, append=False)

Uploads a pandas.DataFrame.

Replaces the actual content of the underlying data frame in the getML engine with pandas_df.

Parameters
  • pandas_df (pandas.DataFrame) – Data the underlying data frame object in the getML engine should obtain.

  • append (bool, optional) – If a data frame object holding the same name is already present in the getML engine, should the content in query be appended or replace the existing data?

Raises

TypeError – If any of the input arguments is of wrong type.

Returns

Current instance.

Return type

DataFrame

Note

For columns containing pandas.Timestamp there can occur small inconsistencies in the order to microseconds when sending the data to the getML engine. This is due to the way the underlying information is stored.

read_query(query, append=False, conn=None)

Fill from query

Fills the data frame with data from a table in the database.

Parameters
  • query (str) – The query used to retrieve the data.

  • append (bool, optional) – If a data frame object holding the same name is already present in the getML engine, should the content in query be appended or replace the existing data?

  • conn (Connection, optional) – The database connection to be used. If you don’t explicitly pass a connection, the engine will use the default connection.

Raises

TypeError – If any of the input arguments is of wrong type.

Returns

Handler of the underlying data.

Return type

DataFrame

read_s3(bucket, keys, region, append=False, sep=',', num_lines_read=0, skip=0, colnames=None, time_formats=None)

Read CSV files from an S3 bucket.

NOTE THAT S3 IS NOT SUPPORTED ON WINDOWS.

It is assumed that the first line of each CSV file contains a header with the column names.

Parameters
  • bucket (str) – The bucket from which to read the files.

  • keys (List[str]) – The list of keys (files in the bucket) to be read.

  • region (str) – The region in which the bucket is located.

  • append (bool, optional) – If a data frame object holding the same name is already present in the getML, should the content of of the CSV files in fnames be appended or replace the existing data?

  • sep (str, optional) – The separator used for separating fields.

  • num_lines_read (int, optional) – Number of lines read from each file. Set to 0 to read in the entire file.

  • skip (int, optional) – Number of lines to skip at the beginning of each file.

  • colnames (List[str] or None, optional) – The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.

  • time_formats (List[str], optional) –

    The list of formats tried when parsing time stamps.

    The formats are allowed to contain the following special characters:

    • %w - abbreviated weekday (Mon, Tue, …)

    • %W - full weekday (Monday, Tuesday, …)

    • %b - abbreviated month (Jan, Feb, …)

    • %B - full month (January, February, …)

    • %d - zero-padded day of month (01 .. 31)

    • %e - day of month (1 .. 31)

    • %f - space-padded day of month ( 1 .. 31)

    • %m - zero-padded month (01 .. 12)

    • %n - month (1 .. 12)

    • %o - space-padded month ( 1 .. 12)

    • %y - year without century (70)

    • %Y - year with century (1970)

    • %H - hour (00 .. 23)

    • %h - hour (00 .. 12)

    • %a - am/pm

    • %A - AM/PM

    • %M - minute (00 .. 59)

    • %S - second (00 .. 59)

    • %s - seconds and microseconds (equivalent to %S.%F)

    • %i - millisecond (000 .. 999)

    • %c - centisecond (0 .. 9)

    • %F - fractional seconds/microseconds (000000 - 999999)

    • %z - time zone differential in ISO 8601 format (Z or +NN.NN)

    • %Z - time zone differential in RFC format (GMT or +NNNN)

    • %% - percent sign

Raises

TypeError – If any of the input arguments is of wrong type.

Returns

Handler of the underlying data.

Return type

DataFrame

refresh()

Aligns meta-information of the current instance with the corresponding data frame in the getML engine.

This method can be used to avoid encoding conflicts. Note that load() as well as several other methods automatically calls refresh().

Raises

Exception – If the getML engine does not respond with a valid DataFrame (which just a precaution and not the expected behavior).

Returns

Updated handle the underlying data frame in the getML engine.

Return type

DataFrame

rowid()

Get the row numbers of the table.

Returns

(numerical) column containing the row id, starting with 0

Return type

VirtualFloatColumn

save()

Writes the underlying data in the getML engine to disk.

To be stored persistently, the corresponding data frame object in the getML engine as to be already created (via send()).

Returns

The current instance.

Return type

DataFrame

set_role(names, role, time_formats=None)

Assigns a new role to one or more columns.

When switching from a role based on type float to a role based on type string or vice verse, an implicit type conversion will be conducted. The time_formats argument is used to interpret time format string. For more information on roles, please refer to the user guide.

Parameters
  • names (str or List[str]) – The name or names of the column.

  • role (str) – The role to be assigned.

  • time_formats (str or List[str], optional) – Formats to be used to parse the time stamps. This is only necessary, if an implicit conversion from a StringColumn to a time stamp is taking place.

Raises
  • TypeError – If any of the input arguments has a wrong type.

  • ValueError – If one of the provided names does not correspond to an existing column.

Example

data_df = dict(
    animal=["hawk", "parrot", "goose"],
    votes=[12341, 5127, 65311],
    date=["04/06/2019", "01/03/2019", "24/12/2018"])
df = getml.data.DataFrame.from_dict(data_df, "animal_elections")
df.set_role(['animal'], getml.data.roles.categorical)
df.set_role(['votes'], getml.data.roles.numerical)
df.set_role(['date'], getml.data.roles.time_stamp, time_formats=['%d/%m/%Y'])

df
| date                        | animal      | votes     |
| time stamp                  | categorical | numerical |
---------------------------------------------------------
| 2019-06-04T00:00:00.000000Z | hawk        | 12341     |
| 2019-03-01T00:00:00.000000Z | parrot      | 5127      |
| 2018-12-24T00:00:00.000000Z | goose       | 65311     |
set_unit(names, unit, comparison_only=False)

Assigns a new unit to one or more columns.

Parameters
  • names (str or List[str]) – The name or names of the column.

  • unit (str) – The unit to be assigned.

  • comparison_only (bool) –

    Whether you want the column to be used for comparison only. This means that the column can only be used in comparison to other columns of the same unit.

    An example might be a bank account number: The number in itself is hardly interesting, but it might be useful to know how often we have seen that same bank account number in another table.

    If True, this will append “, comparison only” to the unit. The feature learning algorithms and the feature selectors will interpret this accordingly.

Raises
  • TypeError – If any of the input arguments is of wrong type.

  • ValueError – If one of the provided names does not correspond to an existing column.

string_column(value)

Generates a string column that consists solely of a single entry.

Parameters

value (str) – The value to be used.

Returns

Column consisting of the singular entry.

Return type

VirtualStringColumn

Raises

TypeError – If any of the input arguments is of wrong type.

to_csv(fname, quotechar='"', sep=',', batch_size=0)

Writes the underlying data into a newly created CSV file.

Parameters
  • fname (str) – The name of the CSV file. The ending “.csv” and an optional batch number will be added automatically.

  • quotechar (str, optional) – The character used to wrap strings.

  • sep (str, optional) – The character used for separating fields.

  • batch_size (int, optional) – Maximum number of lines per file. Set to 0 to read the entire data frame into a single file.

Raises

TypeError – If any of the input arguments is of wrong type.

to_db(table_name, conn=None)

Writes the underlying data into a newly created table in the database.

Parameters
  • table_name (str) –

    Name of the table to be created.

    If a table of that name already exists, it will be replaced.

  • conn (Connection, optional) – The database connection to be used. If you don’t explicitly pass a connection, the engine will use the default connection.

Raises

TypeError – If any of the input arguments is of wrong type.

to_html(max_rows=5)

Represents the data frame in HTML format, optimized for an iPython notebook.

Parameters

max_rows (int) – The maximum number of rows to be displayed.

to_json()

Creates a JSON string from the current instance.

Loads the underlying data from the getML engine and constructs a JSON string.

Returns

JSON string containing the names of the columns of the current instance as keys and their corresponding data as values.

Return type

str

to_pandas()

Creates a pandas.DataFrame from the current instance.

Loads the underlying data from the getML engine and constructs a pandas.DataFrame.

Returns

Pandas equivalent of the current instance including its underlying data.

Return type

pandas.DataFrame

to_placeholder()

Generates a Placeholder from the current DataFrame.

Returns

A placeholder with the same name as this data frame.

Return type

Placeholder

to_s3(bucket, key, region, sep=',', batch_size=50000)

Writes the underlying data into a newly created CSV file located in an S3 bucket.

NOTE THAT S3 IS NOT SUPPORTED ON WINDOWS.

Parameters
  • bucket (str) – The bucket from which to read the files.

  • key (str) – The key in the S3 bucket in which you want to write the output. The ending “.csv” and an optional batch number will be added automatically.

  • region (str) – The region in which the bucket is located.

  • sep (str, optional) – The character used for separating fields.

  • batch_size (int, optional) – Maximum number of lines per file. Set to 0 to read the entire data frame into a single file.

Raises

TypeError – If any of the input arguments is of wrong type.

Example

>>> getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID")
>>>
>>> getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY")
>>>
>>> df_expd.to_s3(
...     bucket="your-bucket-name",
...     key="filename-on-s3",
...     region="us-east-2",
...     sep=';'
... )
where(name, condition)

Extract a subset of rows.

Creates a new DataFrame as a subselection of the current instance. Internally it creates a new data frame object in the getML engine containing only a subset of rows of the original one and returns a handler to this new object.

Parameters
  • name (str) – Name of the new, resulting DataFrame.

  • condition (VirtualBooleanColumn) – Boolean column indicating the rows you want to select.

Raises

TypeError – If any of the input arguments is of wrong type.

Returns

Handler of the newly create data frame contain just a subset of rows of the current instance.

Return type

DataFrame

Example

Generate example data:

data = dict(
    fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"],
    price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4],
    join_key=["0", "1", "2", "2", "3", "3"])

fruits = getml.data.DataFrame.from_dict(data, name="fruits",
roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]})

fruits
| join_key | fruit       | price     |
| join key | categorical | numerical |
--------------------------------------
| 0        | banana      | 2.4       |
| 1        | apple       | 3         |
| 2        | cherry      | 1.2       |
| 2        | cherry      | 1.4       |
| 3        | melon       | 3.4       |
| 3        | pineapple   | 3.4       |

Apply where condition. This creates a new DataFrame called “cherries”:

cherries = fruits.where(
    name="cherries",
    condition=(fruits["fruit"] == "cherry")
)

cherries
| join_key | fruit       | price     |
| join key | categorical | numerical |
--------------------------------------
| 2        | cherry      | 1.2       |
| 2        | cherry      | 1.4       |