DataFrame¶
-
class
getml.data.
DataFrame
(name, roles=None)¶ Bases:
object
Handler for the data stored in the getML engine.
The
DataFrame
class represents a data frame object in the getML engine but does not contain any actual data itself. To create such a data frame object, fill it with data via the Python API, and to retrieve a handler for it, you can use one of thefrom_csv()
,from_db()
,from_json()
, orfrom_pandas()
class methods. The Uploading data section in the user guide does explain in detail particularities of each of those flavors of the unified import interface.In case the data frame object is already present in the engine - either in memory as a temporary object or on disk when
save()
was called earlier -, theload_data_frame()
function will create a new handler without altering the underlying data. For more information about the lifecycle of the data in the getML engine and its synchronization with the Python API please see the corresponding user guide.- Parameters
name (str) – Unique identifier used to link the handler with the underlying data frame object in the engine.
roles (dict[str, List[str]], optional) –
A dictionary mapping the
roles
to the column names (seecolnames()
).The roles dictionary is expected to have the following format
roles = {getml.data.role.numeric: ["colname1", "colname2"], getml.data.role.target: ["colname3"]}
- Raises
TypeError – If any of the input arguments is of wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
Examples
Creating a new data frame object in the getML engine and uploading data is done by one the class functions
from_csv()
,from_db()
,from_json()
, orfrom_pandas()
.random = numpy.random.RandomState(7263) table = pandas.DataFrame() table['column_01'] = random.randint(0, 10, 1000).astype(numpy.str) table['join_key'] = numpy.arange(1000) table['time_stamp'] = random.rand(1000) table['target'] = random.rand(1000) df_table = getml.data.DataFrame.from_pandas(table, name = 'table')
In addition to creating a new data frame object in the getML engine and filling it with all the content of table, the
from_pandas()
function does also return aDataFrame
handler to the underlying data.You don’t have to create the data frame objects anew for each session. You can use their
save()
method to write them to disk, thelist_data_frames()
function to list all available objects in the engine, andload_data_frame()
to create aDataFrame
handler for a data set already present in the getML engine (see Lifecycles and synchronization between engine and API for details).df_table.save() getml.data.list_data_frames() df_table_reloaded = getml.data.load_data_frame('table')
Note
Although the Python API does not store the actual data itself, you can use the
to_csv()
,to_db()
,to_json()
, andto_pandas()
methods to retrieve them.Attributes Summary
List of the names of all categorical columns.
List of the names of all columns.
List of the names of all join keys.
Number of categorical columns.
Number of join keys.
Number of numerical columns.
Number of target columns.
Number of time stamps columns.
Number of unused columns.
Number of unused float columns.
Number of unused string columns.
List of the names of all numerical columns.
A tuple containing the number of rows and columns of the DataFrame.
List of the names of all target columns.
List of the names of all time stamps.
List of the names of all unused float columns.
List of the names of all unused columns.
List of the names of all unused string columns.
Methods Summary
add
(col, name[, role, unit, time_formats])Adds a column to the current
DataFrame
.delete
([mem_only])Deletes the data frame from the getML engine.
from_csv
(fnames, name[, num_lines_sniffed, …])Create a DataFrame from CSV files.
from_db
(table_name, name[, roles, ignore, dry])Create a DataFrame from a table in a database.
from_dict
(data, name[, roles, ignore, dry])Create a new DataFrame from a dict
from_json
(json_str, name[, roles, ignore, dry])Create a new DataFrame from a JSON string.
from_pandas
(pandas_df, name[, roles, …])Create a DataFrame from a
pandas.DataFrame
.group_by
(join_key, name, aggregations)Creates new
DataFrame
by grouping over a join key.join
(name, other, join_key[, …])Create a new
DataFrame
by joining the current instance with anotherDataFrame
.load
()Loads saved data from disk.
n_bytes
()Size of the data stored in the underlying data frame in the getML engine.
n_cols
()Number of columns in the current instance.
n_rows
()Number of rows in the current instance.
num_column
(value)Generates a float or integer column that consists solely of a single entry.
random
([seed])Create random column.
read_csv
(fnames[, append, quotechar, sep, …])Read CSV files.
read_db
(table_name[, append])Fill from Database.
read_json
(json_str[, append, time_formats])Fill from JSON
read_pandas
(pandas_df[, append])Uploads a
pandas.DataFrame
.read_query
(query[, append])Fill from query
refresh
()Aligns meta-information of the current instance with the corresponding data frame in the getML engine.
rm
(name)Remove a column.
rowid
()Get the row numbers of the table.
save
()Writes the underlying data in the getML engine to disk.
set_role
(names, role[, time_formats])Assigns a new role to one or more columns.
set_unit
(names, unit[, comparison_only])Assigns a new unit to one or more columns.
string_column
(value)Generates a string column that consists solely of a single entry.
to_csv
(fname[, quotechar, sep])Writes the underlying data into a newly created CSV file.
to_db
(table_name)Writes the underlying data into a newly created table in the database.
to_json
()Creates a JSON string from the current instance.
Creates a
pandas.DataFrame
from the current instance.Generates a
Placeholder
from the currentDataFrame
.where
(name, condition)Extract a subset of rows.
Attributes Documentation
-
categorical_names
¶ List of the names of all categorical columns.
- Returns
List of the names of all categorical columns.
- Return type
List[str]
-
colnames
¶ List of the names of all columns.
- Returns
List of the names of all columns.
- Return type
List[str]
-
join_key_names
¶ List of the names of all join keys.
- Returns
List of the names of all columns used as join keys.
- Return type
List[str]
-
n_categorical
¶ Number of categorical columns.
- Returns
Number of categorical columns
- Return type
int
-
n_join_keys
¶ Number of join keys.
- Returns
Number of columns used as join keys
- Return type
int
-
n_numerical
¶ Number of numerical columns.
- Returns
Number of numerical columns
- Return type
int
-
n_targets
¶ Number of target columns.
- Returns
Number of columns used as targets
- Return type
int
-
n_time_stamps
¶ Number of time stamps columns.
- Returns
Number of columns used as time stamps
- Return type
int
-
n_unused
¶ Number of unused columns. Unused columns will not be used by the feature engineering algorithms.
- Returns
Number of columns that are unused.
- Return type
int
-
n_unused_floats
¶ Number of unused float columns. Unused columns will not be used by the feature engineering algorithms.
- Returns
Number of columns that are unused.
- Return type
int
-
n_unused_strings
¶ Number of unused string columns. Unused columns will not be used by the feature engineering algorithms.
- Returns
Number of columns that are unused.
- Return type
int
-
numerical_names
¶ List of the names of all numerical columns.
- Returns
List of the names of all numerical columns.
- Return type
List[str]
-
shape
¶ A tuple containing the number of rows and columns of the DataFrame.
-
target_names
¶ List of the names of all target columns.
- Returns
List of the names of all columns used as target.
- Return type
List[str]
-
time_stamp_names
¶ List of the names of all time stamps.
- Returns
List of the names of all columns used as time stamp.
- Return type
List[str]
-
unused_float_names
¶ List of the names of all unused float columns. Unused columns will not be used by the feature engineering algorithms.
- Returns
List of the names of all columns that are unused.
- Return type
List[str]
-
unused_names
¶ List of the names of all unused columns. Unused columns will not be used by the feature engineering algorithms.
- Returns
List of the names of all columns that are unused.
- Return type
List[str]
-
unused_string_names
¶ List of the names of all unused string columns. Unused columns will not be used by the feature engineering algorithms.
- Returns
List of the names of all columns that are unused.
- Return type
List[str]
Methods Documentation
-
add
(col, name, role=None, unit='', time_formats=['%Y-%m-%dT%H:%M:%s%z', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d'])¶ Adds a column to the current
DataFrame
.- Parameters
col (
column
ornumpy.ndarray
) – The column or numpy.ndarray to be added.name (str) – Name of the new column.
role (str, optional) –
Role of the new column. Must be one of the following:
unit (str, optional) – Unit of the column.
time_formats (str, optional) –
Formats to be used to parse the time stamps.
This is only necessary, if an implicit conversion from a
StringColumn
to a time stamp is taking place.The formats are allowed to contain the following special characters:
%w - abbreviated weekday (Mon, Tue, …)
%W - full weekday (Monday, Tuesday, …)
%b - abbreviated month (Jan, Feb, …)
%B - full month (January, February, …)
%d - zero-padded day of month (01 .. 31)
%e - day of month (1 .. 31)
%f - space-padded day of month ( 1 .. 31)
%m - zero-padded month (01 .. 12)
%n - month (1 .. 12)
%o - space-padded month ( 1 .. 12)
%y - year without century (70)
%Y - year with century (1970)
%H - hour (00 .. 23)
%h - hour (00 .. 12)
%a - am/pm
%A - AM/PM
%M - minute (00 .. 59)
%S - second (00 .. 59)
%s - seconds and microseconds (equivalent to %S.%F)
%i - millisecond (000 .. 999)
%c - centisecond (0 .. 9)
%F - fractional seconds/microseconds (000000 - 999999)
%z - time zone differential in ISO 8601 format (Z or +NN.NN)
%Z - time zone differential in RFC format (GMT or +NNNN)
%% - percent sign
- Raises
TypeError – If any of the input arguments is of wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
-
delete
(mem_only=False)¶ Deletes the data frame from the getML engine.
If called with the mem_only option set to True, the data frame corresponding to the handler represented by the current instance can be reloaded using the
load()
method.- Parameters
mem_only (bool, optional) – If True, the data frame will not be deleted permanently but just from memory (RAM).
- Raises
TypeError – If any of the input arguments is of wrong type.
-
classmethod
from_csv
(fnames, name, num_lines_sniffed=1000, quotechar='"', sep=',', skip=0, roles=None, ignore=False, dry=False)¶ Create a DataFrame from CSV files.
The fastest way to import data into the getML engine is to read it directly from CSV files. It will construct a data frame object in the engine, fill it with the data read from the CSV file(s), and return a corresponding
DataFrame
handle.- Parameters
fnames (List[str]) – CSV file paths to be read.
name (str) – Name of the data frame to be created.
num_lines_sniffed (int, optional) – Number of lines analysed by the sniffer.
quotechar (str, optional) – The character used to wrap strings.
sep (str, optional) – The separator used for separating fields.
skip (int, optional) – Number of lines to skip at the beginning of each file.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the CSV files. The roles dictionary should be in the following format:
>>> roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
Note
The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called
save()
method.It is assumed that the first line of each CSV file contains a header with the column names.
In addition to reading data from a CSV file, you can also write an existing
DataFrame
back into one usingto_csv()
or replace/append to the current instance using theread_csv()
method.Examples
Let’s assume you have two CSV files - file1.csv and file2.csv - in the current working directory. You can upload their data into the getML engine using.
>>> df_expd = data.DataFrame.from_csv( ... fnames=["file1.csv", "file2.csv"], ... name="MY DATA FRAME", ... sep=';', ... quotechar='"' ... )
However, the CSV format lacks type safety. If you want to build a reliable pipeline, it is a good idea to hard-code the roles:
>>> roles = {"categorical": ["colname1", "colname2"], "target": ["colname3"]} >>> >>> df_expd = data.DataFrame.from_csv( ... fnames=["file1.csv", "file2.csv"], ... name="MY DATA FRAME", ... sep=';', ... quotechar='"', ... roles=roles ... )
If you think that typing out all of the roles by hand is too cumbersome, you can use a dry run:
>>> roles = data.DataFrame.from_csv( ... fnames=["file1.csv", "file2.csv"], ... name="MY DATA FRAME", ... sep=';', ... quotechar='"', ... dry=True ... )
This will return the roles dictionary it would have used. You can now hard-code this.
-
classmethod
from_db
(table_name, name, roles=None, ignore=False, dry=False)¶ Create a DataFrame from a table in a database.
It will construct a data frame object in the engine, fill it with the data read from table table_name in the connected database (see
database
), and return a correspondingDataFrame
handle.- Parameters
table_name (str) – Name of the table to be read.
name (str) – Name of the data frame to be created.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the table. The roles dictionary should be in the following format:
>>> roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
Note
The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called
save()
method.In addition to reading data from a table, you can also write an existing
DataFrame
back into a new one in the same database usingto_db()
or replace/append to the current instance using theread_db()
orread_query()
method.Example
getml.database.connect_mysql( host="relational.fit.cvut.cz", port=3306, dbname="financial", user="guest", password="relational" ) loan = getml.data.DataFrame.from_db(table_name='loan', name='df_loan')
-
classmethod
from_dict
(data, name, roles=None, ignore=False, dry=False)¶ Create a new DataFrame from a dict
- Parameters
data (dict) –
The dict containing the data. The data should be in the following format:
data = {'col1': [1.0, 2.0, 1.0], 'col2': ['A', 'B', 'C']}
name (str) – Name of the data frame to be created.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the string. The roles dictionary should be in the following format:
roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
-
classmethod
from_json
(json_str, name, roles=None, ignore=False, dry=False)¶ Create a new DataFrame from a JSON string.
It will construct a data frame object in the engine, fill it with the data read from the JSON string, and return a corresponding
DataFrame
handle.- Parameters
json_str (str) –
The JSON string containing the data. The json_str should be in the following format:
json_str = "{'col1': [1.0, 2.0, 1.0], 'col2': ['A', 'B', 'C']}"
name (str) – Name of the data frame to be created.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the string. The roles dictionary should be in the following format:
roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
Note
The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called
save()
method.In addition to reading data from a JSON string, you can also write an existing
DataFrame
back into one usingto_json()
or replace/append to the current instance using theread_json()
method.
-
classmethod
from_pandas
(pandas_df, name, roles=None, ignore=False, dry=False)¶ Create a DataFrame from a
pandas.DataFrame
.It will construct a data frame object in the engine, fill it with the data read from the
pandas.DataFrame
, and return a correspondingDataFrame
handle.- Parameters
pandas_df (
pandas.DataFrame
) – The table to be read.name (str) – Name of the data frame to be created.
roles (dict[str, List[str]], optional) –
A dictionary mapping the roles to the column names. If this is not passed, then the roles will be sniffed from the
pandas.DataFrame
. The roles dictionary should be in the following format:roles = {"role1": ["colname1", "colname2"], "role2": ["colname3"]}
ignore (bool, optional) – Only relevant when roles is not None. Determines what you want to do with any colnames not mentioned in roles. Do you want to ignore them (True) or read them in as unused columns (False)?
dry (bool, optional) – If set to True, then the data will not actually be read. Instead, the method will only return the roles it would have used. This can be used to hard-code roles when setting up a pipeline.
- Raises
TypeError – If any of the input arguments is of a wrong type.
ValueError – If one of the provided keys in roles does not match a definition in
roles
.
- Returns
Handler of the underlying data.
- Return type
Note
The created data frame object is only held in memory by the getML engine. If you want to use it in later sessions or after switching the project, you have to called
save()
method.In addition to reading data from a
pandas.DataFrame
, you can also write an existingDataFrame
back into one usingto_pandas()
or replace/append to the current instance using theread_pandas()
method.
-
group_by
(join_key, name, aggregations)¶ Creates new
DataFrame
by grouping over a join key.This function split the DataFrame into groups with the same value for join_key, applies an aggregation function to one or more columns in each group, and combines the results into a new DataFrame. The aggregation funcion is defined for each column individually. This allows applying different aggregations to each column. In pandas this is known as named aggregation.
- Parameters
join_key (str) – Name of the join key to group by.
name (str) – Name of the new DataFrame.
aggregations (List[
_Aggregation
]) – Methods to apply on the groupings.
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the newly generated data frame object.
- Return type
Examples
Generate example data
data = dict( fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"], price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4], join_key=["0", "1", "2", "2", "3", "3"] ) df = getml.data.DataFrame.from_dict( data, name="fruits", roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]} ) df
| join_key | fruit | price | | join key | categorical | numerical | -------------------------------------- | 0 | banana | 2.4 | | 1 | apple | 3 | | 2 | cherry | 1.2 | | 2 | cherry | 1.4 | | 3 | melon | 3.4 | | 3 | pineapple | 3.4 |
Group DataFrame using join_key. Aggregate the resulting groups by averaging and summing over the price column and counting the distinct entires in the fruit column
df_grouped = df.group_by("join_key", "fruits_grouped", [df["price"].avg(alias="avg price"), df["price"].sum(alias="total price"), df["fruit"].count_distinct(alias="unique items")]) df_grouped
| join_key | avg price | total price | unique items | | join key | unused | unused | unused | ----------------------------------------------------- | 3 | 3.4 | 6.8 | 2 | | 2 | 1.3 | 2.6 | 1 | | 0 | 2.4 | 2.4 | 1 | | 1 | 3 | 3 | 1 |
-
join
(name, other, join_key, other_join_key=None, cols=None, other_cols=None, how='inner', where=None)¶ Create a new
DataFrame
by joining the current instance with anotherDataFrame
.- Parameters
name (str) – The name of the new
DataFrame
.join_key (str) – Name of the column containing the join key in the current instance.
other_join_key (str, optional) – Name of the join key in the other
DataFrame
. If set to None, join_key will be used for both the current instance and other.cols (List[Union[
FloatColumn
,StringFloatColumn
], optional) –columns
in the current instances to be included in the resultingDataFrame
. If set to None, all columns will be used.other_cols (List[Union[
FloatColumn
,StringColumn
], optional) –columns
in other to be included in the resultingDataFrame
. If set to None, all columns will be used.how (str, optional) –
Type of the join.
Supported options:
’left’
’inner’
’right’
where (
_VirtualBooleanColumn
, optional) –Boolean column indicating which rows to be included in the resulting
DataFrame
. If set to None, all rows will be used.If imposes a SQL-like WHERE condition on the join.
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the newly create data frame object.
- Return type
Examples
Create DataFrame
data_df = dict( colors=["blue", "green", "yellow", "orange"], numbers=[2.4, 3.0, 1.2, 1.4], join_key=["0", "1", "2", "3"] ) df = getml.data.DataFrame.from_dict( data_df, name="df_1", roles=dict(join_key=["join_key"], numerical=["numbers"], categorical=["colors"])) df
| join_key | colors | numbers | | join key | categorical | numerical | -------------------------------------- | 0 | blue | 2.4 | | 1 | green | 3 | | 2 | yellow | 1.2 | | 3 | orange | 1.4 |
Create other Data Frame
data_other = dict( colors=["blue", "green", "yellow", "black", "orange", "white"], numbers=[2.4, 3.0, 1.2, 1.4, 3.4, 2.2], join_key=["0", "1", "2", "2", "3", "4"]) other = getml.data.DataFrame.from_dict( data_other, name="df_2", roles=dict(join_key=["join_key"], numerical=["numbers"], categorical=["colors"])) other
| join_key | colors | numbers | | join key | categorical | numerical | -------------------------------------- | 0 | blue | 2.4 | | 1 | green | 3 | | 2 | yellow | 1.2 | | 2 | black | 1.4 | | 3 | orange | 3.4 | | 4 | white | 2.2 |
Left join the two DataFrames on their join key, while keeping the columns ‘colors’ and ‘numbers’ from the first one and the column ‘colors’ as ‘other_color’ from the second one. As subcondition only rows are selected where the ‘number’ columns are equal.
joined_df = df.join( name="joined_df", other=other, how="left", join_key="join_key", cols=[df["colors"], df["numbers"]], other_cols=[other["colors"].alias("other_color")], where=(df["numbers"] == other["numbers"])) joined_df
| colors | other_color | numbers | | categorical | categorical | numerical | ----------------------------------------- | blue | blue | 2.4 | | green | green | 3 | | yellow | yellow | 1.2 |
-
load
()¶ Loads saved data from disk.
The data frame object holding the same name as the current
DataFrame
instance will be loaded from disk into the getML engine and updates the current handler usingrefresh()
.Examples
Firstly, we have to create and upload some data sets.
d, _ = getml.datasets.make_numerical(population_name = 'test') getml.data.list_data_frames()
In the output of
list_data_frames()
we can find our underlying data frame object ‘test’ listed under the ‘in_memory’ key (it was created and uploaded bymake_numerical()
). This means the getML engine does only hold it in memory (RAM) yet and we still have tosave()
it to disk in order toload()
it again or to prevent any loss of information between different sessions.d.save() getml.data.list_data_frames() d2 = getml.data.DataFrame(name = 'test').load()
- Returns
Updated handle the underlying data frame in the getML engine.
- Return type
Note
When invoking
load()
all changes of the underlying data frame object that took place after the last call to thesave()
method will be lost. This methods, thus, enables you to undo changes applied to theDataFrame
.d, _ = getml.datasets.make_numerical() d.save() # Accidental change we want to undo d.rm('column_01') d.load()
If
save()
hasn’t be called on the current instance yet or it wasn’t stored to disk in a previous session,load()
will throw an exceptionFile or directory ‘../projects/X/data/Y/’ not found!
Alternatively,
load_data_frame()
offers a more user-friendly way of creatingDataFrame
handlers to data in the getML engine.
-
n_bytes
()¶ Size of the data stored in the underlying data frame in the getML engine.
- Raises
Exception – If the data frame corresponding to the current instance could not be found in the getML engine.
- Returns
Size of the underlying object in bytes.
- Return type
numpy.uint64
-
n_cols
()¶ Number of columns in the current instance.
- Returns
Overall number of columns
- Return type
int
-
n_rows
()¶ Number of rows in the current instance.
- Raises
Exception – If the data frame corresponding to the current instance could not be found in the getML engine.
- Returns
Overall number of rows
- Return type
numpy.int32
-
num_column
(value)¶ Generates a float or integer column that consists solely of a single entry.
- Parameters
value (float) – The value to be used.
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
FloatColumn consisting of the singular entry.
- Return type
_VirtualFloatColumn
-
random
(seed=5849)¶ Create random column.
The numbers will uniformly distributed from 0.0 to 1.0. This can be used to randomly split a population table into a training and a test set
- Parameters
seed (int) – Seed used for the random number generator.
- Returns
FloatColumn containing random numbers
- Return type
_VirtualFloatColumn
Example
population = getml.data.DataFrame('population') population.add(numpy.zeros(100), 'column_01') print(len(population))
100
idx = population.random(seed=42) population_train = population.where("population_train", idx > 0.7) population_test = population.where("population_test", idx <= 0.7) print(len(population_train), len(population_test))
27 73
-
read_csv
(fnames, append=False, quotechar='"', sep=',', time_formats=['%Y-%m-%dT%H:%M:%s%z', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d'])¶ Read CSV files.
It is assumed that the first line of each CSV file contains a header with the column names.
- Parameters
fnames (List[str]) – CSV file paths to be read.
append (bool, optional) – If a data frame object holding the same
name
is already present in the getML, should the content of of the CSV files in fnames be appended or replace the existing data?quotechar (str, optional) – The character used to wrap strings.
sep (str, optional) – The separator used for separating fields.
time_formats (List[str], optional) –
The list of formats tried when parsing time stamps.
The formats are allowed to contain the following special characters:
%w - abbreviated weekday (Mon, Tue, …)
%W - full weekday (Monday, Tuesday, …)
%b - abbreviated month (Jan, Feb, …)
%B - full month (January, February, …)
%d - zero-padded day of month (01 .. 31)
%e - day of month (1 .. 31)
%f - space-padded day of month ( 1 .. 31)
%m - zero-padded month (01 .. 12)
%n - month (1 .. 12)
%o - space-padded month ( 1 .. 12)
%y - year without century (70)
%Y - year with century (1970)
%H - hour (00 .. 23)
%h - hour (00 .. 12)
%a - am/pm
%A - AM/PM
%M - minute (00 .. 59)
%S - second (00 .. 59)
%s - seconds and microseconds (equivalent to %S.%F)
%i - millisecond (000 .. 999)
%c - centisecond (0 .. 9)
%F - fractional seconds/microseconds (000000 - 999999)
%z - time zone differential in ISO 8601 format (Z or +NN.NN)
%Z - time zone differential in RFC format (GMT or +NNNN)
%% - percent sign
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the underlying data.
- Return type
-
read_db
(table_name, append=False)¶ Fill from Database.
The DataFrame will be filled from a table in the database.
- Parameters
table_name (str) – Table from which we want to retrieve the data.
append (bool, optional) – If a data frame object holding the same
name
is already present in the getML, should the content of table_name be appended or replace the existing data?
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the underlying data.
- Return type
-
read_json
(json_str, append=False, time_formats=['%Y-%m-%dT%H:%M:%s%z', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d'])¶ Fill from JSON
Fills the data frame with data from a JSON string.
- Parameters
json_str (str) – The JSON string containing the data.
append (bool, optional) – If a data frame object holding the same
name
is already present in the getML, should the content of json_str be appended or replace the existing data?time_formats (List[str], optional) –
The list of formats tried when parsing time stamps.
The formats are allowed to contain the following special characters:
%w - abbreviated weekday (Mon, Tue, …)
%W - full weekday (Monday, Tuesday, …)
%b - abbreviated month (Jan, Feb, …)
%B - full month (January, February, …)
%d - zero-padded day of month (01 .. 31)
%e - day of month (1 .. 31)
%f - space-padded day of month ( 1 .. 31)
%m - zero-padded month (01 .. 12)
%n - month (1 .. 12)
%o - space-padded month ( 1 .. 12)
%y - year without century (70)
%Y - year with century (1970)
%H - hour (00 .. 23)
%h - hour (00 .. 12)
%a - am/pm
%A - AM/PM
%M - minute (00 .. 59)
%S - second (00 .. 59)
%s - seconds and microseconds (equivalent to %S.%F)
%i - millisecond (000 .. 999)
%c - centisecond (0 .. 9)
%F - fractional seconds/microseconds (000000 - 999999)
%z - time zone differential in ISO 8601 format (Z or +NN.NN)
%Z - time zone differential in RFC format (GMT or +NNNN)
%% - percent sign
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the underlying data.
- Return type
-
read_pandas
(pandas_df, append=False)¶ Uploads a
pandas.DataFrame
.Replaces the actual content of the underlying data frame in the getML engine with pandas_df.
- Parameters
pandas_df (
pandas.DataFrame
) – Data the underlying data frame object in the getML engine should obtain.append (bool, optional) – If a data frame object holding the same
name
is already present in the getML engine, should the content in query be appended or replace the existing data?
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Current instance.
- Return type
Note
For columns containing
pandas.Timestamp
there can occur small inconsistencies in the order to microseconds when sending the data to the getML engine. This is due to the way the underlying information is stored.
-
read_query
(query, append=False)¶ Fill from query
Fills the data frame with data from a table in the database.
- Parameters
query (str) – The query used to retrieve the data.
append (bool, optional) – If a data frame object holding the same
name
is already present in the getML engine, should the content in query be appended or replace the existing data?
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the underlying data.
- Return type
-
refresh
()¶ Aligns meta-information of the current instance with the corresponding data frame in the getML engine.
This method can be used to avoid encoding conflicts. Note that
load()
as well as several other methods automatically callsrefresh()
.
-
rm
(name)¶ Remove a column.
The column, identified using its name, will be removed both from the current instance and the underlying data frame object in the getML engine.
To keep the current instance and the underlying object in the getML engine in sync, the
refresh()
method will be called internally.- Parameters
name (str) – Name of the column to be removed. Must match exactly one column in the current instance.
- Returns
Updated version of the current instance.
- Return type
-
rowid
()¶ Get the row numbers of the table.
- Returns
(numerical) column containing the row id, starting with 0
- Return type
_VirtualFloatColumn
-
save
()¶ Writes the underlying data in the getML engine to disk.
To be stored persistently, the corresponding data frame object in the getML engine as to be already created (via
send()
).- Returns
The current instance.
- Return type
-
set_role
(names, role, time_formats=['%Y-%m-%dT%H:%M:%s%z', '%Y-%m-%d %H:%M:%S', '%Y-%m-%d'])¶ Assigns a new role to one or more columns.
When switching from a role based on type float to a role based on type string or vice verse an implicit type conversions will be conducted. The
time_formats
argument is used to interpret time format string. For more information on roles please refer to the user guide.- Parameters
names (str or List[str]) – The name or names of the column.
role (str) – The role to be assigned.
time_formats (str, optional) – Formats to be used to parse the time stamps. This is only necessary, if an implicit conversion from a StringColumn to a time stamp is taking place.
- Raises
TypeError – If any of the input arguments is of wrong type.
ValueError – If one of the provided names does not correspond to an existing column.
Example
data_df = dict( animal=["hawk", "parrot", "goose"], votes=[12341, 5127, 65311], date=["04/06/2019", "01/03/2019", "24/12/2018"]) df = getml.data.DataFrame.from_dict(data_df, "animal_elections") df.set_role(['animal'], getml.data.roles.categorical) df.set_role(['votes'], getml.data.roles.numerical) df.set_role(['date'], getml.data.roles.time_stamp, time_formats=['%d/%m/%Y']) df
| date | animal | votes | | time stamp | categorical | numerical | --------------------------------------------------------- | 2019-06-04T00:00:00.000000Z | hawk | 12341 | | 2019-03-01T00:00:00.000000Z | parrot | 5127 | | 2018-12-24T00:00:00.000000Z | goose | 65311 |
-
set_unit
(names, unit, comparison_only=False)¶ Assigns a new unit to one or more columns.
- Parameters
names (str or List[str]) – The name or names of the column.
unit (str) – The unit to be assigned.
comparison_only (bool) –
Whether you want the column to be used for comparison only. This means that the column can only be used in comparison to other columns of the same unit.
An example might be a bank account number: The number in itself is hardly interesting, but it might be useful to know how often we have seen that same bank account number in another table.
If True, this will append “, comparison only” to the unit. The feature engineering algorithms and the feature selectors will interpret this accordingly.
- Raises
TypeError – If any of the input arguments is of wrong type.
ValueError – If one of the provided names does not correspond to an existing column.
-
string_column
(value)¶ Generates a string column that consists solely of a single entry.
- Parameters
value (str) – The value to be used.
- Returns
Column consisting of the singular entry.
- Return type
_VirtualStringColumn
- Raises
TypeError – If any of the input arguments is of wrong type.
-
to_csv
(fname, quotechar='"', sep=',')¶ Writes the underlying data into a newly created CSV file.
- Parameters
fname (str) – The name of the CSV file.
quotechar (str, optional) – The character used to wrap strings.
sep (str, optional) – The character used for separating fields.
- Raises
TypeError – If any of the input arguments is of wrong type.
-
to_db
(table_name)¶ Writes the underlying data into a newly created table in the database.
- Parameters
table_name (str) –
Name of the table to be created.
If a table of that name already exists, it will be replaced.
- Raises
TypeError – If any of the input arguments is of wrong type.
-
to_json
()¶ Creates a JSON string from the current instance.
Loads the underlying data from the getML engine and constructs a JSON string.
- Returns
JSON string containing the names of the columns of the current instance as keys and their corresponding data as values.
- Return type
str
-
to_pandas
()¶ Creates a
pandas.DataFrame
from the current instance.Loads the underlying data from the getML engine and constructs a
pandas.DataFrame
.- Returns
Pandas equivalent of the current instance including its underlying data.
- Return type
-
to_placeholder
()¶ Generates a
Placeholder
from the currentDataFrame
.The
refresh()
method will be called internally to assure the resultingPlaceholder
does correspond to the latest version of the data frame object on the getML engine.- Returns
Data model representing the current instance.
- Return type
-
where
(name, condition)¶ Extract a subset of rows.
Creates a new
DataFrame
as a subselection of the current instance. Internally it creates a new data frame object in the getML engine containing only a subset of rows of the original one and returns a handler to this new object.- Parameters
name (str) – Name of the new, resulting
DataFrame
.condition (
_VirtualBooleanColumn
) – Boolean column indicating the rows you want to select.
- Raises
TypeError – If any of the input arguments is of wrong type.
- Returns
Handler of the newly create data frame contain just a subset of rows of the current instance.
- Return type
Example
Generate example data:
data = dict( fruit=["banana", "apple", "cherry", "cherry", "melon", "pineapple"], price=[2.4, 3.0, 1.2, 1.4, 3.4, 3.4], join_key=["0", "1", "2", "2", "3", "3"]) fruits = getml.data.DataFrame.from_dict(data, name="fruits", roles={"categorical": ["fruit"], "join_key": ["join_key"], "numerical": ["price"]}) fruits
| join_key | fruit | price | | join key | categorical | numerical | -------------------------------------- | 0 | banana | 2.4 | | 1 | apple | 3 | | 2 | cherry | 1.2 | | 2 | cherry | 1.4 | | 3 | melon | 3.4 | | 3 | pineapple | 3.4 |
Apply where condition. This creates a new DataFrame called “cherries”:
cherries = fruits.where( name="cherries", condition=(fruits["fruit"] == "cherry") ) cherries
| join_key | fruit | price | | join key | categorical | numerical | -------------------------------------- | 2 | cherry | 1.2 | | 2 | cherry | 1.4 |