DataFrame

class getml.data.DataFrame(name, roles=None)[source]

Handler for the data stored in the getML engine.

The DataFrame class represents a data frame object in the getML engine but does not contain any actual data itself. To create such a data frame object, fill it with data via the Python API, and to retrieve a handler for it, you can use one of the from_csv(), from_db(), from_json(), or from_pandas() class methods. The Importing data section in the user guide does explain in detail particularities of each of those flavors of the unified import interface.

In case the data frame object is already present in the engine - either in memory as a temporary object or on disk when save() was called earlier -, the load_data_frame() function will create a new handler without altering the underlying data. For more information about the lifecycle of the data in the getML engine and its synchronization with the Python API please see the corresponding user guide.

Args:
name (str): Unique identifier used to link the handler with

the underlying data frame object in the engine.

roles(dict[str, List[str]], optional):

A dictionary mapping the roles to the column names (see colnames()).

The roles dictionary is expected to have the following format

roles = {getml.data.role.numeric: ["colname1", "colname2"],
         getml.data.role.target: ["colname3"]}
Raises:

TypeError: If any of the input arguments is of wrong type.

ValueError:

If one of the provided keys in roles does not match a definition in roles.

Examples:

Creating a new data frame object in the getML engine and importing data is done by one the class functions from_csv(), from_db(), from_json(), or from_pandas().

random = numpy.random.RandomState(7263)

table = pandas.DataFrame()
table['column_01'] = random.randint(0, 10, 1000).astype(numpy.str)
table['join_key'] = numpy.arange(1000)
table['time_stamp'] = random.rand(1000)
table['target'] = random.rand(1000)

df_table = getml.data.DataFrame.from_pandas(table, name = 'table')

In addition to creating a new data frame object in the getML engine and filling it with all the content of table, the from_pandas() function does also return a DataFrame handler to the underlying data.

You don’t have to create the data frame objects anew for each session. You can use their save() method to write them to disk, the list_data_frames() function to list all available objects in the engine, and load_data_frame() to create a DataFrame handler for a data set already present in the getML engine (see Lifecycles and synchronization between engine and API for details).

df_table.save()

getml.data.list_data_frames()

df_table_reloaded = getml.data.load_data_frame('table')

Note:

Although the Python API does not store the actual data itself, you can use the to_csv(), to_db(), to_json(), and to_pandas() methods to retrieve them.

Methods

add(col, name[, role, unit, time_formats])

Adds a column to the current DataFrame.

copy(name)

Creates a deep copy of the data frame under a new name.

delete([mem_only])

Deletes the data frame from the getML engine.

drop(name)

Remove the column identified by name.

from_csv(fnames, name[, num_lines_sniffed, …])

Create a DataFrame from CSV files.

from_db(table_name[, name, roles, ignore, …])

Create a DataFrame from a table in a database.

from_dict(data, name[, roles, ignore, dry])

Create a new DataFrame from a dict

from_json(json_str, name[, roles, ignore, dry])

Create a new DataFrame from a JSON string.

from_pandas(pandas_df, name[, roles, …])

Create a DataFrame from a pandas.DataFrame.

from_s3(bucket, keys, region, name[, …])

Create a DataFrame from CSV files located in an S3 bucket.

group_by(key, name, aggregations)

Creates new DataFrame by grouping over a join key.

join(name, other, join_key[, …])

Create a new DataFrame by joining the current instance with another DataFrame.

load()

Loads saved data from disk.

n_bytes()

Size of the data stored in the underlying data frame in the getML engine.

n_cols()

Number of columns in the current instance.

n_rows()

Number of rows in the current instance.

num_column(value)

Generates a float or integer column that consists solely of a single entry.

random([seed])

Create random column.

read_csv(fnames[, append, quotechar, sep, …])

Read CSV files.

read_db(table_name[, append, conn])

Fill from Database.

read_json(json_str[, append, time_formats])

Fill from JSON

read_pandas(pandas_df[, append])

Uploads a pandas.DataFrame.

read_query(query[, append, conn])

Fill from query

read_s3(bucket, keys, region[, append, sep, …])

Read CSV files from an S3 bucket.

refresh()

Aligns meta-information of the current instance with the corresponding data frame in the getML engine.

rowid()

Get the row numbers of the table.

save()

Writes the underlying data in the getML engine to disk.

set_role(names, role[, time_formats])

Assigns a new role to one or more columns.

set_unit(names, unit[, comparison_only])

Assigns a new unit to one or more columns.

string_column(value)

Generates a string column that consists solely of a single entry.

to_csv(fname[, quotechar, sep, batch_size])

Writes the underlying data into a newly created CSV file.

to_db(table_name[, conn])

Writes the underlying data into a newly created table in the database.

to_html([max_rows])

Represents the data frame in HTML format, optimized for an iPython notebook.

to_json()

Creates a JSON string from the current instance.

to_pandas()

Creates a pandas.DataFrame from the current instance.

to_placeholder()

Generates a Placeholder from the current DataFrame.

to_s3(bucket, key, region[, sep, batch_size])

Writes the underlying data into a newly created CSV file located in an S3 bucket.

where(name, condition)

Extract a subset of rows.

Attributes

categorical_names

List of the names of all categorical columns.

colnames

List of the names of all columns.

join_key_names

List of the names of all join keys.

n_categorical

Number of categorical columns.

n_join_keys

Number of join keys.

n_numerical

Number of numerical columns.

n_targets

Number of target columns.

n_time_stamps

Number of time stamps columns.

n_unused

Number of unused columns.

n_unused_floats

Number of unused float columns.

n_unused_strings

Number of unused string columns.

numerical_names

List of the names of all numerical columns.

roles

The roles of the columns included in this DataFrame.

shape

A tuple containing the number of rows and columns of the DataFrame.

target_names

List of the names of all target columns.

time_stamp_names

List of the names of all time stamps.

unused_float_names

List of the names of all unused float columns.

unused_names

List of the names of all unused columns.

unused_string_names

List of the names of all unused string columns.