DataFrame¶
- class getml.DataFrame(name, roles=None)[source]¶
Handler for the data stored in the getML engine.
The
DataFrame
class represents a data frame object in the getML engine but does not contain any actual data itself. To create such a data frame object, fill it with data via the Python API, and to retrieve a handler for it, you can use one of thefrom_csv()
,from_db()
,from_json()
, orfrom_pandas()
class methods. The Importing data section in the user guide explains the particularities of each of those flavors of the unified import interface.If the data frame object is already present in the engine - either in memory as a temporary object or on disk when
save()
was called earlier -, theload_data_frame()
function will create a new handler without altering the underlying data. For more information about the lifecycle of the data in the getML engine and its synchronization with the Python API please see the corresponding user guide.- Args:
- name (str):
Unique identifier used to link the handler with the underlying data frame object in the engine.
- roles (dict[str, List[str]] or
Roles
, optional): Maps the
roles
to the column names (seecolnames()
).The roles dictionary is expected to have the following format
roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}
Otherwise, you can use the
Roles
class.
- Examples:
Creating a new data frame object in the getML engine and importing data is done by one the class functions
from_csv()
,from_db()
,from_json()
, orfrom_pandas()
.random = numpy.random.RandomState(7263) table = pandas.DataFrame() table['column_01'] = random.randint(0, 10, 1000).astype(numpy.str) table['join_key'] = numpy.arange(1000) table['time_stamp'] = random.rand(1000) table['target'] = random.rand(1000) df_table = getml.DataFrame.from_pandas(table, name = 'table')
In addition to creating a new data frame object in the getML engine and filling it with all the content of table, the
from_pandas()
function also returns aDataFrame
handler to the underlying data.You don’t have to create the data frame objects anew for each session. You can use their
save()
method to write them to disk, thelist_data_frames()
function to list all available objects in the engine, andload_data_frame()
to create aDataFrame
handler for a data set already present in the getML engine (see Lifecycles and synchronization between engine and API for details).df_table.save() getml.data.list_data_frames() df_table_reloaded = getml.data.load_data_frame('table')
- Note:
Although the Python API does not store the actual data itself, you can use the
to_csv()
,to_db()
,to_json()
, andto_pandas()
methods to retrieve them.
Methods
add
(col, name[, role, subroles, unit, ...])Adds a column to the current
DataFrame
.copy
(name)Creates a deep copy of the data frame under a new name.
delete
()Permanently deletes the data frame.
drop
(cols)Returns a new
View
that has one or several columns removed.freeze
()Freezes the data frame.
from_arrow
(table, name[, roles, ignore, dry])Create a DataFrame from an Arrow Table.
from_csv
(fnames, name[, num_lines_sniffed, ...])Create a DataFrame from CSV files.
from_db
(table_name[, name, roles, ignore, ...])Create a DataFrame from a table in a database.
from_dict
(data, name[, roles, ignore, dry])Create a new DataFrame from a dict
from_json
(json_str, name[, roles, ignore, dry])Create a new DataFrame from a JSON string.
from_pandas
(pandas_df, name[, roles, ...])Create a DataFrame from a
pandas.DataFrame
.from_parquet
(fname, name[, roles, ignore, dry])Create a DataFrame from parquet files.
from_pyspark
(spark_df, name[, roles, ...])Create a DataFrame from a
pyspark.sql.DataFrame
.from_s3
(bucket, keys, region, name[, ...])Create a DataFrame from CSV files located in an S3 bucket.
from_view
(view, name[, dry])Create a DataFrame from a
View
.load
()Loads saved data from disk.
nbytes
()Size of the data stored in the underlying data frame in the getML engine.
ncols
()Number of columns in the current instance.
nrows
()Number of rows in the current instance.
read_arrow
(table[, append])Uploads a
pyarrow.Table
.read_csv
(fnames[, append, quotechar, sep, ...])Read CSV files.
read_db
(table_name[, append, conn])Fill from Database.
read_json
(json_str[, append, time_formats])Fill from JSON
read_pandas
(pandas_df[, append])Uploads a
pandas.DataFrame
.read_parquet
(fname[, append, verbose])Read a parquet file.
read_pyspark
(spark_df[, append])Uploads a
pyspark.sql.DataFrame
.read_query
(query[, append, conn])Fill from query
read_s3
(bucket, keys, region[, append, sep, ...])Read CSV files from an S3 bucket.
read_view
(view[, append])Read the data from a
View
.refresh
()Aligns meta-information of the current instance with the corresponding data frame in the getML engine.
remove_subroles
(cols)Removes all
subroles
from one or more columns.remove_unit
(cols)Removes the unit from one or more columns.
save
()Writes the underlying data in the getML engine to disk.
set_role
(cols, role[, time_formats])Assigns a new role to one or more columns.
set_subroles
(cols, subroles[, append])Assigns one or several new
subroles
to one or more columns.set_unit
(cols, unit[, comparison_only])Assigns a new unit to one or more columns.
to_arrow
()Creates a
pyarrow.Table
from the current instance.to_csv
(fname[, quotechar, sep, batch_size])Writes the underlying data into a newly created CSV file.
to_db
(table_name[, conn])Writes the underlying data into a newly created table in the database.
to_html
([max_rows])Represents the data frame in HTML format, optimized for an iPython notebook.
to_json
()Creates a JSON string from the current instance.
Creates a
pandas.DataFrame
from the current instance.to_parquet
(fname[, compression])Writes the underlying data into a newly created parquet file.
to_placeholder
([name])Generates a
Placeholder
from the currentDataFrame
.to_pyspark
(spark[, name])Creates a
pyspark.sql.DataFrame
from the current instance.to_s3
(bucket, key, region[, sep, batch_size])Writes the underlying data into a newly created CSV file located in an S3 bucket.
unload
()Unloads the data frame from memory.
where
(index)Extract a subset of rows.
with_column
(col, name[, role, subroles, ...])Returns a new
View
that contains an additional column.with_name
(name)Returns a new
View
with a new name.with_role
(cols, role[, time_formats])Returns a new
View
with modified roles.with_subroles
(cols, subroles[, append])Returns a new view with one or several new subroles on one or more columns.
with_unit
(cols, unit[, comparison_only])Returns a view that contains a new unit on one or more columns.
Attributes
List of the names of all columns.
Alias for
colnames()
.A string describing the last time this data frame has been changed.
Convencience wrapper that returns the memory usage in MB.
The roles of the columns included in this DataFrame.
The rowids for this data frame.
A tuple containing the number of rows and columns of the DataFrame.