DataFrame¶
-
class
getml.data.
DataFrame
(name, roles=None)¶ Bases:
object
Handler for the data stored in the getML engine.
The
DataFrame
class represents a data frame object in the getML engine but does not contain any actual data itself. To create such a data frame object, fill it with data via the Python API, and to retrieve a handler for it, you can use one of thefrom_csv()
,from_db()
,from_json()
, orfrom_pandas()
class methods. The Importing data section in the user guide does explain in detail particularities of each of those flavors of the unified import interface.In case the data frame object is already present in the engine - either in memory as a temporary object or on disk when
save()
was called earlier -, theload_data_frame()
function will create a new handler without altering the underlying data. For more information about the lifecycle of the data in the getML engine and its synchronization with the Python API please see the corresponding user guide.- Parameters
name (str) – Unique identifier used to link the handler with the underlying data frame object in the engine.
roles (dict[str, List[str]], optional) –
A dictionary mapping the
roles
to the column names (seecolnames()
).The roles dictionary is expected to have the following format
roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}
- Raises
TypeError – If any of the input arguments is of wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
Examples
Creating a new data frame object in the getML engine and importing data is done by one the class functions
from_csv()
,from_db()
,from_json()
, orfrom_pandas()
.random = numpy.random.RandomState(7263) table = pandas.DataFrame() table['column_01'] = random.randint(0, 10, 1000).astype(numpy.str) table['join_key'] = numpy.arange(1000) table['time_stamp'] = random.rand(1000) table['target'] = random.rand(1000) df_table = getml.data.DataFrame.from_pandas(table, name = 'table')
In addition to creating a new data frame object in the getML engine and filling it with all the content of table, the
from_pandas()
function does also return aDataFrame
handler to the underlying data.You don’t have to create the data frame objects anew for each session. You can use their
save()
method to write them to disk, thelist_data_frames()
function to list all available objects in the engine, andload_data_frame()
to create aDataFrame
handler for a data set already present in the getML engine (see Lifecycles and synchronization between engine and API for details).df_table.save() getml.data.list_data_frames() df_table_reloaded = getml.data.load_data_frame('table')
Note
Although the Python API does not store the actual data itself, you can use the
to_csv()
,to_db()
,to_json()
, andto_pandas()
methods to retrieve them.Attributes Summary
List of the names of all categorical columns.
List of the names of all columns.
List of the names of all join keys.
Number of categorical columns.
Number of join keys.
Number of numerical columns.
Number of target columns.
Number of time stamps columns.
Number of unused columns.
Number of unused float columns.
Number of unused string columns.
List of the names of all numerical columns.
The roles of the columns included in this DataFrame.
A tuple containing the number of rows and columns of the DataFrame.
List of the names of all target columns.
List of the names of all time stamps.
List of the names of all unused float columns.
List of the names of all unused columns.
List of the names of all unused string columns.
Methods Summary
add
(col, name[, role, unit, time_formats])Adds a column to the current
DataFrame
.copy
(name)Creates a deep copy of the data frame under a new name.
delete
([mem_only])Deletes the data frame from the getML engine.
drop
(name)Remove the column identified by name.
from_csv
(fnames, name[, num_lines_sniffed, …])Create a DataFrame from CSV files.
from_db
(table_name[, name, roles, ignore, …])Create a DataFrame from a table in a database.
from_dict
(data, name[, roles, ignore, dry])Create a new DataFrame from a dict
from_json
(json_str, name[, roles, ignore, dry])Create a new DataFrame from a JSON string.
from_pandas
(pandas_df, name[, roles, …])Create a DataFrame from a
pandas.DataFrame
.from_s3
(bucket, keys, region, name[, …])Create a DataFrame from CSV files located in an S3 bucket.
group_by
(key, name, aggregations)Creates new
DataFrame
by grouping over a join key.join
(name, other, join_key[, …])Create a new
DataFrame
by joining the current instance with anotherDataFrame
.load
()Loads saved data from disk.
n_bytes
()Size of the data stored in the underlying data frame in the getML engine.
n_cols
()Number of columns in the current instance.
n_rows
()Number of rows in the current instance.
num_column
(value)Generates a float or integer column that consists solely of a single entry.
random
([seed])Create random column.
read_csv
(fnames[, append, quotechar, sep, …])Read CSV files.
read_db
(table_name[, append, conn])Fill from Database.
read_json
(json_str[, append, time_formats])Fill from JSON
read_pandas
(pandas_df[, append])Uploads a
pandas.DataFrame
.read_query
(query[, append, conn])Fill from query
read_s3
(bucket, keys, region[, append, sep, …])Read CSV files from an S3 bucket.
refresh
()Aligns meta-information of the current instance with the corresponding data frame in the getML engine.
rowid
()Get the row numbers of the table.
save
()Writes the underlying data in the getML engine to disk.
set_role
(names, role[, time_formats])Assigns a new role to one or more columns.
set_unit
(names, unit[, comparison_only])Assigns a new unit to one or more columns.
string_column
(value)Generates a string column that consists solely of a single entry.
to_csv
(fname[, quotechar, sep, batch_size])Writes the underlying data into a newly created CSV file.
to_db
(table_name[, conn])Writes the underlying data into a newly created table in the database.
to_html
([max_rows])Represents the data frame in HTML format, optimized for an iPython notebook.
to_json
()Creates a JSON string from the current instance.
Creates a
pandas.DataFrame
from the current instance.Generates a
Placeholder
from the currentDataFrame
.to_s3
(bucket, key, region[, sep, batch_size])Writes the underlying data into a newly created CSV file located in an S3 bucket.
where
(name, condition)Extract a subset of rows.
Attributes Documentation
-
categorical_names
¶ List of the names of all categorical columns.
- Returns
List of the names of all categorical columns.
- Return type
List[str]
-
colnames
¶ List of the names of all columns.
- Returns
List of the names of all columns.
- Return type
List[str]
-
join_key_names
¶ List of the names of all join keys.
- Returns
List of the names of all columns used as join keys.
- Return type
List[str]
-
n_categorical
¶ Number of categorical columns.
- Returns
Number of categorical columns
- Return type
int
-
n_join_keys
¶ Number of join keys.
- Returns
Number of columns used as join keys
- Return type
int
-
n_numerical
¶ Number of numerical columns.
- Returns
Number of numerical columns
- Return type
int
-
n_targets
¶ Number of target columns.
- Returns
Number of columns used as targets
- Return type
int
-
n_time_stamps
¶ Number of time stamps columns.
- Returns
Number of columns used as time stamps
- Return type
int
-
n_unused
¶ Number of unused columns. Unused columns will not be used by the feature learning algorithms.
- Returns
Number of columns that are unused.
- Return type
int
-
n_unused_floats
¶ Number of unused float columns. Unused columns will not be used by the feature learning algorithms.
- Returns
Number of columns that are unused.
- Return type
int
-
n_unused_strings
¶ Number of unused string columns. Unused columns will not be used by the feature learning algorithms.
- Returns
Number of columns that are unused.
- Return type
int
-
numerical_names
¶ List of the names of all numerical columns.
- Returns
List of the names of all numerical columns.
- Return type
List[str]
-
roles
¶ The roles of the columns included in this DataFrame.
-
shape
¶ A tuple containing the number of rows and columns of the DataFrame.
-
target_names
¶ List of the names of all target columns.
- Returns
List of the names of all columns used as target.
- Return type
List[str]
-
time_stamp_names
¶ List of the names of all time stamps.
- Returns
List of the names of all columns used as time stamp.
- Return type
List[str]
-
unused_float_names
¶ List of the names of all unused float columns. Unused columns will not be used by the feature learning algorithms.
- Returns
List of the names of all columns that are unused.
- Return type
List[str]
-
unused_names
¶ List of the names of all unused columns. Unused columns will not be used by the feature learning algorithms.
- Returns
List of the names of all columns that are unused.
- Return type
List[str]
-
unused_string_names
¶ List of the names of all unused string columns. Unused columns will not be used by the feature learning algorithms.
- Returns
List of the names of all columns that are unused.
- Return type
List[str]
Methods Documentation
-
add
(col, name, role=None, unit='', time_formats=None)¶ Adds a column to the current
DataFrame
.- Parameters
col (
column
ornumpy.ndarray
) – The column or numpy.ndarray to be added.name (str) – Name of the new column.
role (str, optional) –
Role of the new column. Must be one of the following:
unit (str, optional) – Unit of the column.
time_formats (str, optional) –
Formats to be used to parse the time stamps.
This is only necessary, if an implicit conversion from a
StringColumn
to a time stamp is taking place.The formats are allowed to contain the following special characters:
%w - abbreviated weekday (Mon, Tue, …)
%W - full weekday (Monday, Tuesday, …)
%b - abbreviated month (Jan, Feb, …)
%B - full month (January, February, …)
%d - zero-padded day of month (01 .. 31)
%e - day of month (1 .. 31)
%f - space-padded day of month ( 1 .. 31)
%m - zero-padded month (01 .. 12)
%n - month (1 .. 12)
%o - space-padded month ( 1 .. 12)
%y - year without century (70)
%Y - year with century (1970)
%H - hour (00 .. 23)
%h - hour (00 .. 12)
%a - am/pm
%A - AM/PM
%M - minute (00 .. 59)
%S - second (00 .. 59)
%s - seconds and microseconds (equivalent to %S.%F)
%i - millisecond (000 .. 999)
%c - centisecond (0 .. 9)
%F - fractional seconds/microseconds (000000 - 999999)
%z - time zone differential in ISO 8601 format (Z or +NN.NN)
%Z - time zone differential in RFC format (GMT or +NNNN)
%% - percent sign
- Raises
TypeError – If any of the input arguments is of wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
-
copy
(name)¶ Creates a deep copy of the data frame under a new name.
- Parameters
name (str) – The name of the new data frame.
- Returns
A handle to the deep copy.
- Return type
-
delete
(mem_only=False)¶ Deletes the data frame from the getML engine.
If called with the mem_only option set to True, the data frame corresponding to the handler represented by the current instance can be reloaded using the
load()
method.- Parameters
mem_only (bool, optional) – If True, the data frame will not be deleted permanently but just from memory (RAM).
- Raises
TypeError – If any of the input arguments is of wrong type.
-
drop
(name)¶ Remove the column identified by name.
- Parameters
name (str) – Name of the column to be removed. Must match exactly one column in the current instance.
- Returns
Updated version of the current instance.
- Return type
-
classmethod
from_csv
(fnames, name, num_lines_sniffed=1000, num_lines_read=0, quotechar='"', sep=',', skip=0, colnames=None, roles=None, ignore=False, dry=False)¶ Create a DataFrame from CSV files.
The fastest way to import data into the getML engine is to read it directly from CSV files. It will construct a data frame object in the engine, fill it with the data read from the CSV file(s), and return a corresponding
DataFrame
handle.- Parameters
fnames (List[str]) – CSV file paths to be read.
name (str) – Name of the data frame to be created.
num_lines_sniffed (int, optional) – Number of lines analyzed by the sniffer.
num_lines_read (int, optional) – Number of lines read from each file. Set to 0 to read in the entire file.
quotechar (str, optional) – The character used to wrap strings.
sep (str, optional) – The separator used for separating fields.
skip (int, optional) – Number of lines to skip at the beginning of each file.
colnames (List[str] or None, optional) – The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the CSV files. The roles dictionary should be in the following format:
>>> roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
Note
The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called
save()
method.It is assumed that the first line of each CSV file contains a header with the column names.
In addition to reading data from a CSV file, you can also write an existing
DataFrame
back into one usingto_csv()
or replace/append to the current instance using theread_csv()
method.Examples
Let’s assume you have two CSV files - file1.csv and file2.csv - in the current working directory. You can import their data into the getML engine using.
>>> df_expd = data.DataFrame.from_csv( ... fnames=["file1.csv", "file2.csv"], ... name="MY DATA FRAME", ... sep=';', ... quotechar='"' ... )
However, the CSV format lacks type safety. If you want to build a reliable pipeline, it is a good idea to hard-code the roles:
>>> roles = {"categorical": ["colname1", "colname2"], "target": ["colname3"]} >>> >>> df_expd = data.DataFrame.from_csv( ... fnames=["file1.csv", "file2.csv"], ... name="MY DATA FRAME", ... sep=';', ... quotechar='"', ... roles=roles ... )
If you think that typing out all of the roles by hand is too cumbersome, you can use a dry run:
>>> roles = data.DataFrame.from_csv( ... fnames=["file1.csv", "file2.csv"], ... name="MY DATA FRAME", ... sep=';', ... quotechar='"', ... dry=True ... )
This will return the roles dictionary it would have used. You can now hard-code this.
-
classmethod
from_db
(table_name, name=None, roles=None, ignore=False, dry=False, conn=None)¶ Create a DataFrame from a table in a database.
It will construct a data frame object in the engine, fill it with the data read from table table_name in the connected database (see
database
), and return a correspondingDataFrame
handle.- Parameters
table_name (str) – Name of the table to be read.
name (str) – Name of the data frame to be created. If not passed, then the table_name will be used.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the table. The roles dictionary should be in the following format:
>>> roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
conn (
Connection
, optional) – The database connection to be used. If you don’t explicitly pass a connection, the engine will use the default connection.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
Note
The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called
save()
method.In addition to reading data from a table, you can also write an existing
DataFrame
back into a new one in the same database usingto_db()
or replace/append to the current instance using theread_db()
orread_query()
method.Example
getml.database.connect_mysql( host="relational.fit.cvut.cz", port=3306, dbname="financial", user="guest", password="relational" ) loan = getml.data.DataFrame.from_db(table_name='loan', name='data_frame_loan')
-
classmethod
from_dict
(data, name, roles=None, ignore=False, dry=False)¶ Create a new DataFrame from a dict
- Parameters
data (dict) –
The dict containing the data. The data should be in the following format:
data = {'col1': [1.0, 2.0, 1.0], 'col2': ['A', 'B', 'C']}
name (str) – Name of the data frame to be created.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the string. The roles dictionary should be in the following format:
roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
-
classmethod
from_json
(json_str, name, roles=None, ignore=False, dry=False)¶ Create a new DataFrame from a JSON string.
It will construct a data frame object in the engine, fill it with the data read from the JSON string, and return a corresponding
DataFrame
handle.- Parameters
json_str (str) –
The JSON string containing the data. The json_str should be in the following format:
json_str = "{'col1': [1.0, 2.0, 1.0], 'col2': ['A', 'B', 'C']}"
name (str) – Name of the data frame to be created.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the string. The roles dictionary should be in the following format:
roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
Note
The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called
save()
method.In addition to reading data from a JSON string, you can also write an existing
DataFrame
back into one usingto_json()
or replace/append to the current instance using theread_json()
method.
-
classmethod
from_pandas
(pandas_df, name, roles=None, ignore=False, dry=False)¶ Create a DataFrame from a
pandas.DataFrame
.It will construct a data frame object in the engine, fill it with the data read from the
pandas.DataFrame
, and return a correspondingDataFrame
handle.- Parameters
pandas_df (
pandas.DataFrame
) – The table to be read.name (str) – Name of the data frame to be created.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the
pandas.DataFrame
. The roles dictionary should be in the following format:roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
Note
The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called
save()
method.In addition to reading data from a
pandas.DataFrame
, you can also write an existingDataFrame
back into one usingto_pandas()
or replace/append to the current instance using theread_pandas()
method.
-
classmethod
from_s3
(bucket, keys, region, name, num_lines_sniffed=1000, num_lines_read=0, sep=',', skip=0, colnames=None, roles=None, ignore=False, dry=False)¶ Create a DataFrame from CSV files located in an S3 bucket.
NOTE THAT S3 IS NOT SUPPORTED ON WINDOWS.
This classmethod will construct a data frame object in the engine, fill it with the data read from the CSV file(s), and return a corresponding
DataFrame
handle.- Parameters
bucket (str) – The bucket from which to read the files.
keys (List[str]) – The list of keys (files in the bucket) to be read.
region (str) – The region in which the bucket is located.
name (str) – Name of the data frame to be created.
num_lines_sniffed (int, optional) – Number of lines analyzed by the sniffer.
num_lines_read (int, optional) – Number of lines read from each file. Set to 0 to read in the entire file.
sep (str, optional) – The separator used for separating fields.
skip (int, optional) – Number of lines to skip at the beginning of each file.
colnames (List[str] or None, optional) – The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the CSV files. The roles dictionary should be in the following format:
>>> roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
Note
The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called
save()
method.It is assumed that the first line of each CSV file contains a header with the column names.
Example
Let’s assume you have two CSV files - file1.csv and file2.csv - in the bucket. You can import their data into the getML engine using the following commands:
>>> getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID") >>> >>> getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY") >>> >>> data_frame_expd = data.DataFrame.from_s3( ... bucket="your-bucket-name", ... keys=["file1.csv", "file2.csv"], ... region="us-east-2", ... name="MY DATA FRAME", ... sep=';' ... )
You can also set the access credential as environment variables before you launch the getML engine.
Also refer to the documention on
from_csv()
for further information on overriding the CSV sniffer for greater type safety.
-
group_by
(key, name, aggregations)¶ Creates new
DataFrame
by grouping over a join key.This function split the DataFrame into groups with the same value for join_key, applies an aggregation function to one or more columns in each group, and combines the results into a new DataFrame. The aggregation funcion is defined for each column individually. This allows applying different aggregations to each column. In pandas this is known as named aggregation.
- Parameters
key (str) – Name of the key to group by. If the key is a join key, the group_by will be faster, because join keys already have an index, whereas all other columns need to have an index built for the group_by.
name (str) – Name of the new DataFrame.
aggregations (List[
Aggregation
]) – Methods to apply on the groupings.
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the newly generated data frame object.
- Return type
Examples
Generate example data
data = dict( fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"], price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4], join_key=["0", "1", "2", "2", "3", "3"] ) df = getml.data.DataFrame.from_dict( data, name="fruits", roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]} ) df
| join_key | fruit | price | | join key | categorical | numerical | -------------------------------------- | 0 | banana | 2.4 | | 1 | apple | 3 | | 2 | cherry | 1.2 | | 2 | cherry | 1.4 | | 3 | melon | 3.4 | | 3 | pineapple | 3.4 |
Group DataFrame using join_key. Aggregate the resulting groups by averaging and summing over the price column and counting the distinct entires in the fruit column
df_grouped = df.group_by("join_key", "fruits_grouped", [df["price"].avg(alias="avg price"), df["price"].sum(alias="total price"), df["fruit"].count_distinct(alias="unique items")]) df_grouped
| join_key | avg price | total price | unique items | | join key | unused | unused | unused | ----------------------------------------------------- | 3 | 3.4 | 6.8 | 2 | | 2 | 1.3 | 2.6 | 1 | | 0 | 2.4 | 2.4 | 1 | | 1 | 3 | 3 | 1 |
-
join
(name, other, join_key, other_join_key=None, cols=None, other_cols=None, how='inner', where=None)¶ Create a new
DataFrame
by joining the current instance with anotherDataFrame
.- Parameters
name (str) – The name of the new
DataFrame
.join_key (str) – Name of the column containing the join key in the current instance.
other_join_key (str, optional) – Name of the join key in the other
DataFrame
. If set to None, join_key will be used for both the current instance and other.(List[Union[ (other_cols) –
FloatColumn
,StringFloatColumn
], optional):columns
in the current instances to be included in the resultingDataFrame
. If set to None, all columns will be used.(List[Union[ –
FloatColumn
,StringColumn
], optional):columns
in other to be included in the resultingDataFrame
. If set to None, all columns will be used.how (str, optional) –
Type of the join.
Supported options:
’left’
’inner’
’right’
where (
VirtualBooleanColumn
, optional) –Boolean column indicating which rows to be included in the resulting
DataFrame
. If set to None, all rows will be used.If imposes a SQL-like WHERE condition on the join.
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the newly create data frame object.
- Return type
Examples
Create DataFrame
data_df = dict( colors=["blue", "green", "yellow", "orange"], numbers=[2.4, 3.0, 1.2, 1.4], join_key=["0", "1", "2", "3"] ) df = getml.data.DataFrame.from_dict( data_df, name="df_1", roles=dict(join_key=["join_key"], numerical=["numbers"], categorical=["colors"])) df
| join_key | colors | numbers | | join key | categorical | numerical | -------------------------------------- | 0 | blue | 2.4 | | 1 | green | 3 | | 2 | yellow | 1.2 | | 3 | orange | 1.4 |
Create other Data Frame
data_other = dict( colors=["blue", "green", "yellow", "black", "orange", "white"], numbers=[2.4, 3.0, 1.2, 1.4, 3.4, 2.2], join_key=["0", "1", "2", "2", "3", "4"]) other = getml.data.DataFrame.from_dict( data_other, name="df_2", roles=dict(join_key=["join_key"], numerical=["numbers"], categorical=["colors"])) other
| join_key | colors | numbers | | join key | categorical | numerical | -------------------------------------- | 0 | blue | 2.4 | | 1 | green | 3 | | 2 | yellow | 1.2 | | 2 | black | 1.4 | | 3 | orange | 3.4 | | 4 | white | 2.2 |
Left join the two DataFrames on their join key, while keeping the columns ‘colors’ and ‘numbers’ from the first one and the column ‘colors’ as ‘other_color’ from the second one. As subcondition only rows are selected where the ‘number’ columns are equal.
joined_df = df.join( name="joined_df", other=other, how="left", join_key="join_key", cols=[df["colors"], df["numbers"]], other_cols=[other["colors"].alias("other_color")], where=(df["numbers"] == other["numbers"])) joined_df
| colors | other_color | numbers | | categorical | categorical | numerical | ----------------------------------------- | blue | blue | 2.4 | | green | green | 3 | | yellow | yellow | 1.2 |
-
load
()¶ Loads saved data from disk.
The data frame object holding the same name as the current
DataFrame
instance will be loaded from disk into the getML engine and updates the current handler usingrefresh()
.Examples
Firstly, we have to create and imporimport data sets.
d, _ = getml.datasets.make_numerical(population_name = 'test') getml.data.list_data_frames()
In the output of
list_data_frames()
we can find our underlying data frame object ‘test’ listed under the ‘in_memory’ key (it was created and imported bymake_numerical()
). This means the getML engine does only hold it in memory (RAM) yet and we still have tosave()
it to disk in order toload()
it again or to prevent any loss of information between different sessions.d.save() getml.data.list_data_frames() d2 = getml.data.DataFrame(name = 'test').load()
- Returns
Updated handle the underlying data frame in the getML engine.
- Return type
Note
When invoking
load()
all changes of the underlying data frame object that took place after the last call to thesave()
method will be lost. This methods, thus, enables you to undo changes applied to theDataFrame
.d, _ = getml.datasets.make_numerical() d.save() # Accidental change we want to undo d.rm('column_01') d.load()
If
save()
hasn’t be called on the current instance yet or it wasn’t stored to disk in a previous session,load()
will throw an exceptionFile or directory ‘../projects/X/data/Y/’ not found!
Alternatively,
load_data_frame()
offers an easier way of creatingDataFrame
handlers to data in the getML engine.
-
n_bytes
()¶ Size of the data stored in the underlying data frame in the getML engine.
- Raises
Exception – If the data frame corresponding to the current instance could not be found in the getML engine.
- Returns
Size of the underlying object in bytes.
- Return type
numpy.uint64
-
n_cols
()¶ Number of columns in the current instance.
- Returns
Overall number of columns
- Return type
int
-
n_rows
()¶ Number of rows in the current instance.
- Raises
Exception – If the data frame corresponding to the current instance could not be found in the getML engine.
- Returns
Overall number of rows
- Return type
numpy.int32
-
num_column
(value)¶ Generates a float or integer column that consists solely of a single entry.
- Parameters
value (float) – The value to be used.
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
FloatColumn consisting of the singular entry.
- Return type
-
random
(seed=5849)¶ Create random column.
The numbers will uniformly distributed from 0.0 to 1.0. This can be used to randomly split a population table into a training and a test set
- Parameters
seed (int) – Seed used for the random number generator.
- Returns
FloatColumn containing random numbers
- Return type
Example
population = getml.data.DataFrame('population') population.add(numpy.zeros(100), 'column_01') print(len(population))
100
idx = population.random(seed=42) population_train = population.where("population_train", idx > 0.7) population_test = population.where("population_test", idx <= 0.7) print(len(population_train), len(population_test))
27 73
-
read_csv
(fnames, append=False, quotechar='"', sep=',', num_lines_read=0, skip=0, colnames=None, time_formats=None)¶ Read CSV files.
It is assumed that the first line of each CSV file contains a header with the column names.
- Parameters
fnames (List[str]) – CSV file paths to be read.
append (bool, optional) – If a data frame object holding the same
name
is already present in the getML, should the content of of the CSV files in fnames be appended or replace the existing data?quotechar (str, optional) – The character used to wrap strings.
sep (str, optional) – The separator used for separating fields.
num_lines_read (int, optional) – Number of lines read from each file. Set to 0 to read in the entire file.
skip (int, optional) – Number of lines to skip at the beginning of each file.
colnames (List[str] or None, optional) – The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.
time_formats (List[str], optional) –
The list of formats tried when parsing time stamps.
The formats are allowed to contain the following special characters:
%w - abbreviated weekday (Mon, Tue, …)
%W - full weekday (Monday, Tuesday, …)
%b - abbreviated month (Jan, Feb, …)
%B - full month (January, February, …)
%d - zero-padded day of month (01 .. 31)
%e - day of month (1 .. 31)
%f - space-padded day of month ( 1 .. 31)
%m - zero-padded month (01 .. 12)
%n - month (1 .. 12)
%o - space-padded month ( 1 .. 12)
%y - year without century (70)
%Y - year with century (1970)
%H - hour (00 .. 23)
%h - hour (00 .. 12)
%a - am/pm
%A - AM/PM
%M - minute (00 .. 59)
%S - second (00 .. 59)
%s - seconds and microseconds (equivalent to %S.%F)
%i - millisecond (000 .. 999)
%c - centisecond (0 .. 9)
%F - fractional seconds/microseconds (000000 - 999999)
%z - time zone differential in ISO 8601 format (Z or +NN.NN)
%Z - time zone differential in RFC format (GMT or +NNNN)
%% - percent sign
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the underlying data.
- Return type
-
read_db
(table_name, append=False, conn=None)¶ Fill from Database.
The DataFrame will be filled from a table in the database.
- Parameters
table_name (str) – Table from which we want to retrieve the data.
append (bool, optional) – If a data frame object holding the same
name
is already present in the getML, should the content of table_name be appended or replace the existing data?conn (
Connection
, optional) – The database connection to be used. If you don’t explicitly pass a connection, the engine will use the default connection.
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the underlying data.
- Return type
-
read_json
(json_str, append=False, time_formats=None)¶ Fill from JSON
Fills the data frame with data from a JSON string.
- Parameters
json_str (str) – The JSON string containing the data.
append (bool, optional) – If a data frame object holding the same
name
is already present in the getML, should the content of json_str be appended or replace the existing data?time_formats (List[str], optional) –
The list of formats tried when parsing time stamps.
The formats are allowed to contain the following special characters:
%w - abbreviated weekday (Mon, Tue, …)
%W - full weekday (Monday, Tuesday, …)
%b - abbreviated month (Jan, Feb, …)
%B - full month (January, February, …)
%d - zero-padded day of month (01 .. 31)
%e - day of month (1 .. 31)
%f - space-padded day of month ( 1 .. 31)
%m - zero-padded month (01 .. 12)
%n - month (1 .. 12)
%o - space-padded month ( 1 .. 12)
%y - year without century (70)
%Y - year with century (1970)
%H - hour (00 .. 23)
%h - hour (00 .. 12)
%a - am/pm
%A - AM/PM
%M - minute (00 .. 59)
%S - second (00 .. 59)
%s - seconds and microseconds (equivalent to %S.%F)
%i - millisecond (000 .. 999)
%c - centisecond (0 .. 9)
%F - fractional seconds/microseconds (000000 - 999999)
%z - time zone differential in ISO 8601 format (Z or +NN.NN)
%Z - time zone differential in RFC format (GMT or +NNNN)
%% - percent sign
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the underlying data.
- Return type
-
read_pandas
(pandas_df, append=False)¶ Uploads a
pandas.DataFrame
.Replaces the actual content of the underlying data frame in the getML engine with pandas_df.
- Parameters
pandas_df (
pandas.DataFrame
) – Data the underlying data frame object in the getML engine should obtain.append (bool, optional) – If a data frame object holding the same
name
is already present in the getML engine, should the content in query be appended or replace the existing data?
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Current instance.
- Return type
Note
For columns containing
pandas.Timestamp
there can occur small inconsistencies in the order to microseconds when sending the data to the getML engine. This is due to the way the underlying information is stored.
-
read_query
(query, append=False, conn=None)¶ Fill from query
Fills the data frame with data from a table in the database.
- Parameters
query (str) – The query used to retrieve the data.
append (bool, optional) – If a data frame object holding the same
name
is already present in the getML engine, should the content in query be appended or replace the existing data?conn (
Connection
, optional) – The database connection to be used. If you don’t explicitly pass a connection, the engine will use the default connection.
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the underlying data.
- Return type
-
read_s3
(bucket, keys, region, append=False, sep=',', num_lines_read=0, skip=0, colnames=None, time_formats=None)¶ Read CSV files from an S3 bucket.
NOTE THAT S3 IS NOT SUPPORTED ON WINDOWS.
It is assumed that the first line of each CSV file contains a header with the column names.
- Parameters
bucket (str) – The bucket from which to read the files.
keys (List[str]) – The list of keys (files in the bucket) to be read.
region (str) – The region in which the bucket is located.
append (bool, optional) – If a data frame object holding the same
name
is already present in the getML, should the content of of the CSV files in fnames be appended or replace the existing data?sep (str, optional) – The separator used for separating fields.
num_lines_read (int, optional) – Number of lines read from each file. Set to 0 to read in the entire file.
skip (int, optional) – Number of lines to skip at the beginning of each file.
colnames (List[str] or None, optional) – The first line of a CSV file usually contains the column names. When this is not the case, you need to explicitly pass them.
time_formats (List[str], optional) –
The list of formats tried when parsing time stamps.
The formats are allowed to contain the following special characters:
%w - abbreviated weekday (Mon, Tue, …)
%W - full weekday (Monday, Tuesday, …)
%b - abbreviated month (Jan, Feb, …)
%B - full month (January, February, …)
%d - zero-padded day of month (01 .. 31)
%e - day of month (1 .. 31)
%f - space-padded day of month ( 1 .. 31)
%m - zero-padded month (01 .. 12)
%n - month (1 .. 12)
%o - space-padded month ( 1 .. 12)
%y - year without century (70)
%Y - year with century (1970)
%H - hour (00 .. 23)
%h - hour (00 .. 12)
%a - am/pm
%A - AM/PM
%M - minute (00 .. 59)
%S - second (00 .. 59)
%s - seconds and microseconds (equivalent to %S.%F)
%i - millisecond (000 .. 999)
%c - centisecond (0 .. 9)
%F - fractional seconds/microseconds (000000 - 999999)
%z - time zone differential in ISO 8601 format (Z or +NN.NN)
%Z - time zone differential in RFC format (GMT or +NNNN)
%% - percent sign
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the underlying data.
- Return type
-
refresh
()¶ Aligns meta-information of the current instance with the corresponding data frame in the getML engine.
This method can be used to avoid encoding conflicts. Note that
load()
as well as several other methods automatically callsrefresh()
.
-
rowid
()¶ Get the row numbers of the table.
- Returns
(numerical) column containing the row id, starting with 0
- Return type
-
save
()¶ Writes the underlying data in the getML engine to disk.
To be stored persistently, the corresponding data frame object in the getML engine as to be already created (via
send()
).- Returns
The current instance.
- Return type
-
set_role
(names, role, time_formats=None)¶ Assigns a new role to one or more columns.
When switching from a role based on type float to a role based on type string or vice verse, an implicit type conversion will be conducted. The
time_formats
argument is used to interpret time format string. For more information on roles, please refer to the user guide.- Parameters
names (str or List[str]) – The name or names of the column.
role (str) – The role to be assigned.
time_formats (str or List[str], optional) – Formats to be used to parse the time stamps. This is only necessary, if an implicit conversion from a StringColumn to a time stamp is taking place.
- Raises
TypeError – If any of the input arguments has a wrong type.
ValueError – If one of the provided names does not correspond to an existing column.
Example
data_df = dict( animal=["hawk", "parrot", "goose"], votes=[12341, 5127, 65311], date=["04/06/2019", "01/03/2019", "24/12/2018"]) df = getml.data.DataFrame.from_dict(data_df, "animal_elections") df.set_role(['animal'], getml.data.roles.categorical) df.set_role(['votes'], getml.data.roles.numerical) df.set_role(['date'], getml.data.roles.time_stamp, time_formats=['%d/%m/%Y']) df
| date | animal | votes | | time stamp | categorical | numerical | --------------------------------------------------------- | 2019-06-04T00:00:00.000000Z | hawk | 12341 | | 2019-03-01T00:00:00.000000Z | parrot | 5127 | | 2018-12-24T00:00:00.000000Z | goose | 65311 |
-
set_unit
(names, unit, comparison_only=False)¶ Assigns a new unit to one or more columns.
- Parameters
names (str or List[str]) – The name or names of the column.
unit (str) – The unit to be assigned.
comparison_only (bool) –
Whether you want the column to be used for comparison only. This means that the column can only be used in comparison to other columns of the same unit.
An example might be a bank account number: The number in itself is hardly interesting, but it might be useful to know how often we have seen that same bank account number in another table.
If True, this will append “, comparison only” to the unit. The feature learning algorithms and the feature selectors will interpret this accordingly.
- Raises
TypeError – If any of the input arguments is of wrong type.
ValueError – If one of the provided names does not correspond to an existing column.
-
string_column
(value)¶ Generates a string column that consists solely of a single entry.
- Parameters
value (str) – The value to be used.
- Returns
Column consisting of the singular entry.
- Return type
- Raises
TypeError – If any of the input arguments is of wrong type.
-
to_csv
(fname, quotechar='"', sep=',', batch_size=0)¶ Writes the underlying data into a newly created CSV file.
- Parameters
fname (str) – The name of the CSV file. The ending “.csv” and an optional batch number will be added automatically.
quotechar (str, optional) – The character used to wrap strings.
sep (str, optional) – The character used for separating fields.
batch_size (int, optional) – Maximum number of lines per file. Set to 0 to read the entire data frame into a single file.
- Raises
TypeError – If any of the input arguments is of wrong type.
-
to_db
(table_name, conn=None)¶ Writes the underlying data into a newly created table in the database.
- Parameters
table_name (str) –
Name of the table to be created.
If a table of that name already exists, it will be replaced.
conn (
Connection
, optional) – The database connection to be used. If you don’t explicitly pass a connection, the engine will use the default connection.
- Raises
TypeError – If any of the input arguments is of wrong type.
-
to_html
(max_rows=5)¶ Represents the data frame in HTML format, optimized for an iPython notebook.
- Parameters
max_rows (int) – The maximum number of rows to be displayed.
-
to_json
()¶ Creates a JSON string from the current instance.
Loads the underlying data from the getML engine and constructs a JSON string.
- Returns
JSON string containing the names of the columns of the current instance as keys and their corresponding data as values.
- Return type
str
-
to_pandas
()¶ Creates a
pandas.DataFrame
from the current instance.Loads the underlying data from the getML engine and constructs a
pandas.DataFrame
.- Returns
Pandas equivalent of the current instance including its underlying data.
- Return type
-
to_placeholder
()¶ Generates a
Placeholder
from the currentDataFrame
.- Returns
A placeholder with the same name as this data frame.
- Return type
-
to_s3
(bucket, key, region, sep=',', batch_size=50000)¶ Writes the underlying data into a newly created CSV file located in an S3 bucket.
NOTE THAT S3 IS NOT SUPPORTED ON WINDOWS.
- Parameters
bucket (str) – The bucket from which to read the files.
key (str) – The key in the S3 bucket in which you want to write the output. The ending “.csv” and an optional batch number will be added automatically.
region (str) – The region in which the bucket is located.
sep (str, optional) – The character used for separating fields.
batch_size (int, optional) – Maximum number of lines per file. Set to 0 to read the entire data frame into a single file.
- Raises
TypeError – If any of the input arguments is of wrong type.
Example
>>> getml.engine.set_s3_access_key_id("YOUR-ACCESS-KEY-ID") >>> >>> getml.engine.set_s3_secret_access_key("YOUR-SECRET-ACCESS-KEY") >>> >>> df_expd.to_s3( ... bucket="your-bucket-name", ... key="filename-on-s3", ... region="us-east-2", ... sep=';' ... )
-
where
(name, condition)¶ Extract a subset of rows.
Creates a new
DataFrame
as a subselection of the current instance. Internally it creates a new data frame object in the getML engine containing only a subset of rows of the original one and returns a handler to this new object.- Parameters
name (str) – Name of the new, resulting
DataFrame
.condition (
VirtualBooleanColumn
) – Boolean column indicating the rows you want to select.
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the newly create data frame contain just a subset of rows of the current instance.
- Return type
Example
Generate example data:
data = dict( fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"], price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4], join_key=["0", "1", "2", "2", "3", "3"]) fruits = getml.data.DataFrame.from_dict(data, name="fruits", roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]}) fruits
| join_key | fruit | price | | join key | categorical | numerical | -------------------------------------- | 0 | banana | 2.4 | | 1 | apple | 3 | | 2 | cherry | 1.2 | | 2 | cherry | 1.4 | | 3 | melon | 3.4 | | 3 | pineapple | 3.4 |
Apply where condition. This creates a new DataFrame called “cherries”:
cherries = fruits.where( name="cherries", condition=(fruits["fruit"] == "cherry") ) cherries
| join_key | fruit | price | | join key | categorical | numerical | -------------------------------------- | 2 | cherry | 1.2 | | 2 | cherry | 1.4 |