DataFrame¶

class getml.data.DataFrame(name, roles=None)[source]¶

Handler for the data stored in the getML engine.

The DataFrame class represents a data frame object in the getML engine but does not contain any actual data itself. To create such a data frame object, fill it with data via the Python API, and to retrieve a handler for it, you can use one of the from_csv(), from_db(), from_json(), or from_pandas() class methods. The Importing data section in the user guide does explain in detail particularities of each of those flavors of the unified import interface.

In case the data frame object is already present in the engine - either in memory as a temporary object or on disk when save() was called earlier -, the load_data_frame() function will create a new handler without altering the underlying data. For more information about the lifecycle of the data in the getML engine and its synchronization with the Python API please see the corresponding user guide.

Args:

name (str): Unique identifier used to link the handler with: the underlying data frame object in the engine.

roles(dict[str, List[str]], optional):

A dictionary mapping the roles to the column names (see colnames()).

The roles dictionary is expected to have the following format
roles = {getml.data.role.numeric: ["colname1", "colname2"],
         getml.data.role.target: ["colname3"]}

Raises:

TypeError: If any of the input arguments is of wrong type.

ValueError:: If one of the provided keys in roles does not match a definition in roles.

Examples:

Creating a new data frame object in the getML engine and importing data is done by one the class functions from_csv(), from_db(), from_json(), or from_pandas().
random = numpy.random.RandomState(7263)

table = pandas.DataFrame()
table['column_01'] = random.randint(0, 10, 1000).astype(numpy.str)
table['join_key'] = numpy.arange(1000)
table['time_stamp'] = random.rand(1000)
table['target'] = random.rand(1000)

df_table = getml.data.DataFrame.from_pandas(table, name = 'table')
In addition to creating a new data frame object in the getML engine and filling it with all the content of table, the from_pandas() function does also return a DataFrame handler to the underlying data.

You don’t have to create the data frame objects anew for each session. You can use their save() method to write them to disk, the list_data_frames() function to list all available objects in the engine, and load_data_frame() to create a DataFrame handler for a data set already present in the getML engine (see Lifecycles and synchronization between engine and API for details).
df_table.save()

getml.data.list_data_frames()

df_table_reloaded = getml.data.load_data_frame('table')

Note:

Although the Python API does not store the actual data itself, you can use the to_csv(), to_db(), to_json(), and to_pandas() methods to retrieve them.

Methods

`add`(col, name[, role, unit, time_formats])	Adds a column to the current `DataFrame`.
`copy`(name)	Creates a deep copy of the data frame under a new name.
`delete`([mem_only])	Deletes the data frame from the getML engine.
`drop`(name)	Remove the column identified by name.
`from_csv`(fnames, name[, num_lines_sniffed, …])	Create a DataFrame from CSV files.
`from_db`(table_name[, name, roles, ignore, …])	Create a DataFrame from a table in a database.
`from_dict`(data, name[, roles, ignore, dry])	Create a new DataFrame from a dict
`from_json`(json_str, name[, roles, ignore, dry])	Create a new DataFrame from a JSON string.
`from_pandas`(pandas_df, name[, roles, …])	Create a DataFrame from a `pandas.DataFrame`.
`from_s3`(bucket, keys, region, name[, …])	Create a DataFrame from CSV files located in an S3 bucket.
`group_by`(key, name, aggregations)	Creates new `DataFrame` by grouping over a join key.
`join`(name, other, join_key[, …])	Create a new `DataFrame` by joining the current instance with another `DataFrame`.
`load`()	Loads saved data from disk.
`n_bytes`()	Size of the data stored in the underlying data frame in the getML engine.
`n_cols`()	Number of columns in the current instance.
`n_rows`()	Number of rows in the current instance.
`num_column`(value)	Generates a float or integer column that consists solely of a single entry.
`random`([seed])	Create random column.
`read_csv`(fnames[, append, quotechar, sep, …])	Read CSV files.
`read_db`(table_name[, append, conn])	Fill from Database.
`read_json`(json_str[, append, time_formats])	Fill from JSON
`read_pandas`(pandas_df[, append])	Uploads a `pandas.DataFrame`.
`read_query`(query[, append, conn])	Fill from query
`read_s3`(bucket, keys, region[, append, sep, …])	Read CSV files from an S3 bucket.
`refresh`()	Aligns meta-information of the current instance with the corresponding data frame in the getML engine.
`rowid`()	Get the row numbers of the table.
`save`()	Writes the underlying data in the getML engine to disk.
`set_role`(names, role[, time_formats])	Assigns a new role to one or more columns.
`set_unit`(names, unit[, comparison_only])	Assigns a new unit to one or more columns.
`string_column`(value)	Generates a string column that consists solely of a single entry.
`to_csv`(fname[, quotechar, sep, batch_size])	Writes the underlying data into a newly created CSV file.
`to_db`(table_name[, conn])	Writes the underlying data into a newly created table in the database.
`to_html`([max_rows])	Represents the data frame in HTML format, optimized for an iPython notebook.
`to_json`()	Creates a JSON string from the current instance.
`to_pandas`()	Creates a `pandas.DataFrame` from the current instance.
`to_placeholder`()	Generates a `Placeholder` from the current `DataFrame`.
`to_s3`(bucket, key, region[, sep, batch_size])	Writes the underlying data into a newly created CSV file located in an S3 bucket.
`where`(name, condition)	Extract a subset of rows.

Attributes

`categorical_names`	List of the names of all categorical columns.
`colnames`	List of the names of all columns.
`join_key_names`	List of the names of all join keys.
`n_categorical`	Number of categorical columns.
`n_join_keys`	Number of join keys.
`n_numerical`	Number of numerical columns.
`n_targets`	Number of target columns.
`n_time_stamps`	Number of time stamps columns.
`n_unused`	Number of unused columns.
`n_unused_floats`	Number of unused float columns.
`n_unused_strings`	Number of unused string columns.
`numerical_names`	List of the names of all numerical columns.
`roles`	The roles of the columns included in this DataFrame.
`shape`	A tuple containing the number of rows and columns of the DataFrame.
`target_names`	List of the names of all target columns.
`time_stamp_names`	List of the names of all time stamps.
`unused_float_names`	List of the names of all unused float columns.
`unused_names`	List of the names of all unused columns.
`unused_string_names`	List of the names of all unused string columns.