DataFrame¶
-
class
getml.data.
DataFrame
(name, roles=None)[source]¶ Handler for the data stored in the getML engine.
The
DataFrame
class represents a data frame object in the getML engine but does not contain any actual data itself. To create such a data frame object, fill it with data via the Python API, and to retrieve a handler for it, you can use one of thefrom_csv()
,from_db()
,from_json()
, orfrom_pandas()
class methods. The Importing data section in the user guide does explain in detail particularities of each of those flavors of the unified import interface.In case the data frame object is already present in the engine - either in memory as a temporary object or on disk when
save()
was called earlier -, theload_data_frame()
function will create a new handler without altering the underlying data. For more information about the lifecycle of the data in the getML engine and its synchronization with the Python API please see the corresponding user guide.- Args:
- name (str): Unique identifier used to link the handler with
the underlying data frame object in the engine.
roles(dict[str, List[str]], optional):
A dictionary mapping the
roles
to the column names (seecolnames()
).The roles dictionary is expected to have the following format
roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}
- Raises:
TypeError: If any of the input arguments is of wrong type.
- ValueError:
If one of the provided keys in roles does not match a definition in
roles
.
Examples:
Creating a new data frame object in the getML engine and importing data is done by one the class functions
from_csv()
,from_db()
,from_json()
, orfrom_pandas()
.random = numpy.random.RandomState(7263) table = pandas.DataFrame() table['column_01'] = random.randint(0, 10, 1000).astype(numpy.str) table['join_key'] = numpy.arange(1000) table['time_stamp'] = random.rand(1000) table['target'] = random.rand(1000) df_table = getml.data.DataFrame.from_pandas(table, name = 'table')
In addition to creating a new data frame object in the getML engine and filling it with all the content of table, the
from_pandas()
function does also return aDataFrame
handler to the underlying data.You don’t have to create the data frame objects anew for each session. You can use their
save()
method to write them to disk, thelist_data_frames()
function to list all available objects in the engine, andload_data_frame()
to create aDataFrame
handler for a data set already present in the getML engine (see Lifecycles and synchronization between engine and API for details).df_table.save() getml.data.list_data_frames() df_table_reloaded = getml.data.load_data_frame('table')
Note:
Although the Python API does not store the actual data itself, you can use the
to_csv()
,to_db()
,to_json()
, andto_pandas()
methods to retrieve them.Methods
add
(col, name[, role, unit, time_formats])Adds a column to the current
DataFrame
.copy
(name)Creates a deep copy of the data frame under a new name.
delete
([mem_only])Deletes the data frame from the getML engine.
drop
(name)Remove the column identified by name.
from_csv
(fnames, name[, num_lines_sniffed, …])Create a DataFrame from CSV files.
from_db
(table_name[, name, roles, ignore, …])Create a DataFrame from a table in a database.
from_dict
(data, name[, roles, ignore, dry])Create a new DataFrame from a dict
from_json
(json_str, name[, roles, ignore, dry])Create a new DataFrame from a JSON string.
from_pandas
(pandas_df, name[, roles, …])Create a DataFrame from a
pandas.DataFrame
.from_s3
(bucket, keys, region, name[, …])Create a DataFrame from CSV files located in an S3 bucket.
group_by
(key, name, aggregations)Creates new
DataFrame
by grouping over a join key.join
(name, other, join_key[, …])Create a new
DataFrame
by joining the current instance with anotherDataFrame
.load
()Loads saved data from disk.
n_bytes
()Size of the data stored in the underlying data frame in the getML engine.
n_cols
()Number of columns in the current instance.
n_rows
()Number of rows in the current instance.
num_column
(value)Generates a float or integer column that consists solely of a single entry.
random
([seed])Create random column.
read_csv
(fnames[, append, quotechar, sep, …])Read CSV files.
read_db
(table_name[, append, conn])Fill from Database.
read_json
(json_str[, append, time_formats])Fill from JSON
read_pandas
(pandas_df[, append])Uploads a
pandas.DataFrame
.read_query
(query[, append, conn])Fill from query
read_s3
(bucket, keys, region[, append, sep, …])Read CSV files from an S3 bucket.
refresh
()Aligns meta-information of the current instance with the corresponding data frame in the getML engine.
rowid
()Get the row numbers of the table.
save
()Writes the underlying data in the getML engine to disk.
set_role
(names, role[, time_formats])Assigns a new role to one or more columns.
set_unit
(names, unit[, comparison_only])Assigns a new unit to one or more columns.
string_column
(value)Generates a string column that consists solely of a single entry.
to_csv
(fname[, quotechar, sep, batch_size])Writes the underlying data into a newly created CSV file.
to_db
(table_name[, conn])Writes the underlying data into a newly created table in the database.
to_html
([max_rows])Represents the data frame in HTML format, optimized for an iPython notebook.
to_json
()Creates a JSON string from the current instance.
Creates a
pandas.DataFrame
from the current instance.Generates a
Placeholder
from the currentDataFrame
.to_s3
(bucket, key, region[, sep, batch_size])Writes the underlying data into a newly created CSV file located in an S3 bucket.
where
(name, condition)Extract a subset of rows.
Attributes
List of the names of all categorical columns.
List of the names of all columns.
List of the names of all join keys.
Number of categorical columns.
Number of join keys.
Number of numerical columns.
Number of target columns.
Number of time stamps columns.
Number of unused columns.
Number of unused float columns.
Number of unused string columns.
List of the names of all numerical columns.
The roles of the columns included in this DataFrame.
A tuple containing the number of rows and columns of the DataFrame.
List of the names of all target columns.
List of the names of all time stamps.
List of the names of all unused float columns.
List of the names of all unused columns.
List of the names of all unused string columns.