DataFrame¶

class getml.data.DataFrame(name, roles=None)[source]¶

Handler for the data stored in the getML engine.

The DataFrame class represents a data frame object in the getML engine but does not contain any actual data itself. To create such a data frame object, fill it with data via the Python API, and to retrieve a handler for it, you can use one of the from_csv(), from_db(), from_json(), or from_pandas() class methods. The Importing data section in the user guide explains the particularities of each of those flavors of the unified import interface.

If the data frame object is already present in the engine - either in memory as a temporary object or on disk when save() was called earlier -, the load_data_frame() function will create a new handler without altering the underlying data. For more information about the lifecycle of the data in the getML engine and its synchronization with the Python API please see the corresponding user guide.

Args:

name (str):

Unique identifier used to link the handler with the underlying data frame object in the engine.

roles (dict[str, List[str]] or Roles, optional):

Maps the roles to the column names (see colnames()).

The roles dictionary is expected to have the following format

roles = {getml.data.role.numeric: ["colname1", "colname2"],
         getml.data.role.target: ["colname3"]}

Otherwise, you can use the Roles class.

Examples:

Creating a new data frame object in the getML engine and importing data is done by one the class functions from_csv(), from_db(), from_json(), or from_pandas().

random = numpy.random.RandomState(7263)

table = pandas.DataFrame()
table['column_01'] = random.randint(0, 10, 1000).astype(numpy.str)
table['join_key'] = numpy.arange(1000)
table['time_stamp'] = random.rand(1000)
table['target'] = random.rand(1000)

df_table = getml.DataFrame.from_pandas(table, name = 'table')

In addition to creating a new data frame object in the getML engine and filling it with all the content of table, the from_pandas() function also returns a DataFrame handler to the underlying data.

You don’t have to create the data frame objects anew for each session. You can use their save() method to write them to disk, the list_data_frames() function to list all available objects in the engine, and load_data_frame() to create a DataFrame handler for a data set already present in the getML engine (see Lifecycles and synchronization between engine and API for details).

df_table.save()

getml.data.list_data_frames()

df_table_reloaded = getml.data.load_data_frame('table')

Note:

Although the Python API does not store the actual data itself, you can use the to_csv(), to_db(), to_json(), and to_pandas() methods to retrieve them.

Methods

`add`(col, name[, role, subroles, unit, ...])	Adds a column to the current `DataFrame`.
`copy`(name)	Creates a deep copy of the data frame under a new name.
`delete`()	Permanently deletes the data frame.
`drop`(cols)	Returns a new `View` that has one or several columns removed.
`freeze`()	Freezes the data frame.
`from_arrow`(table, name[, roles, ignore, dry])	Create a DataFrame from an Arrow Table.
`from_csv`(fnames, name[, num_lines_sniffed, ...])	Create a DataFrame from CSV files.
`from_db`(table_name[, name, roles, ignore, ...])	Create a DataFrame from a table in a database.
`from_dict`(data, name[, roles, ignore, dry])	Create a new DataFrame from a dict
`from_json`(json_str, name[, roles, ignore, dry])	Create a new DataFrame from a JSON string.
`from_pandas`(pandas_df, name[, roles, ...])	Create a DataFrame from a `pandas.DataFrame`.
`from_parquet`(fname, name[, roles, ignore, dry])	Create a DataFrame from parquet files.
`from_pyspark`(spark_df, name[, roles, ...])	Create a DataFrame from a `pyspark.sql.DataFrame`.
`from_s3`(bucket, keys, region, name[, ...])	Create a DataFrame from CSV files located in an S3 bucket.
`from_view`(view, name[, dry])	Create a DataFrame from a `View`.
`load`()	Loads saved data from disk.
`nbytes`()	Size of the data stored in the underlying data frame in the getML engine.
`ncols`()	Number of columns in the current instance.
`nrows`()	Number of rows in the current instance.
`read_arrow`(table[, append])	Uploads a `pyarrow.Table`.
`read_csv`(fnames[, append, quotechar, sep, ...])	Read CSV files.
`read_db`(table_name[, append, conn])	Fill from Database.
`read_json`(json_str[, append, time_formats])	Fill from JSON
`read_pandas`(pandas_df[, append])	Uploads a `pandas.DataFrame`.
`read_parquet`(fname[, append, verbose])	Read a parquet file.
`read_pyspark`(spark_df[, append])	Uploads a `pyspark.sql.DataFrame`.
`read_query`(query[, append, conn])	Fill from query
`read_s3`(bucket, keys, region[, append, sep, ...])	Read CSV files from an S3 bucket.
`read_view`(view[, append])	Read the data from a `View`.
`refresh`()	Aligns meta-information of the current instance with the corresponding data frame in the getML engine.
`remove_subroles`(cols)	Removes all `subroles` from one or more columns.
`remove_unit`(cols)	Removes the unit from one or more columns.
`save`()	Writes the underlying data in the getML engine to disk.
`set_role`(cols, role[, time_formats])	Assigns a new role to one or more columns.
`set_subroles`(cols, subroles[, append])	Assigns one or several new `subroles` to one or more columns.
`set_unit`(cols, unit[, comparison_only])	Assigns a new unit to one or more columns.
`to_arrow`()	Creates a `pyarrow.Table` from the current instance.
`to_csv`(fname[, quotechar, sep, batch_size])	Writes the underlying data into a newly created CSV file.
`to_db`(table_name[, conn])	Writes the underlying data into a newly created table in the database.
`to_html`([max_rows])	Represents the data frame in HTML format, optimized for an iPython notebook.
`to_json`()	Creates a JSON string from the current instance.
`to_pandas`()	Creates a `pandas.DataFrame` from the current instance.
`to_parquet`(fname[, compression])	Writes the underlying data into a newly created parquet file.
`to_placeholder`([name])	Generates a `Placeholder` from the current `DataFrame`.
`to_pyspark`(spark[, name])	Creates a `pyspark.sql.DataFrame` from the current instance.
`to_s3`(bucket, key, region[, sep, batch_size])	Writes the underlying data into a newly created CSV file located in an S3 bucket.
`unload`()	Unloads the data frame from memory.
`where`(index)	Extract a subset of rows.
`with_column`(col, name[, role, subroles, ...])	Returns a new `View` that contains an additional column.
`with_name`(name)	Returns a new `View` with a new name.
`with_role`(cols, role[, time_formats])	Returns a new `View` with modified roles.
`with_subroles`(cols, subroles[, append])	Returns a new view with one or several new subroles on one or more columns.
`with_unit`(cols, unit[, comparison_only])	Returns a view that contains a new unit on one or more columns.

Attributes

`colnames`	List of the names of all columns.
`columns`	Alias for `colnames()`.
`last_change`	A string describing the last time this data frame has been changed.
`memory_usage`	Convencience wrapper that returns the memory usage in MB.
`roles`	The roles of the columns included in this DataFrame.
`rowid`	The rowids for this data frame.
`shape`	A tuple containing the number of rows and columns of the DataFrame.