getml.data

Contains functionalities for importing, handling, and retrieving data from the getML engine.

All data relevant for the getML suite has to be present in the getML engine. Its Python API itself does not store any of the data used for training or prediction. Instead, it provides a handler class for the data frame objects in the getML engine, the DataFrame. Either using this overall handler for the underlying data set or the individual columns its composed of, one can both import and retrieve data from the engine as well as performing operations on them. In addition to the data frame objects, the engine also uses an abstract and light weight version of the underlying data model, which is represented by the Placeholder.

In general, working with data within the getML suite is organized in three different steps.

Examples:

Creating a new data frame object in the getML engine and importing data is done by one the class methods from_csv(), from_db(), from_json(), or from_pandas().

In this example we chose to directly load data from a public database in the internet. But, firstly, we have to connect the getML engine to the database (see MySQL interface in the user guide for further details).

getml.database.connect_mysql(
    host="relational.fit.cvut.cz",
    port=3306,
    dbname="financial",
    user="guest",
    password="relational",
    time_formats=['%Y/%m/%d']
)

Using the established connection, we can tell the engine to construct a new data frame object called ‘df_loan’, fill it with the data of ‘loan’ table contained in the MySQL database, and return a DataFrame handler associated with it.

loan = getml.DataFrame.from_db('loan', 'df_loan')

print(loan)
...
| loan_id      | account_id   | amount       | duration     | date          | payments      | status        |
| unused float | unused float | unused float | unused float | unused string | unused string | unused string |
-------------------------------------------------------------------------------------------------------------
| 4959         | 2            | 80952        | 24           | 1994-01-05    | 3373.00       | A             |
| 4961         | 19           | 30276        | 12           | 1996-04-29    | 2523.00       | B             |
| 4962         | 25           | 30276        | 12           | 1997-12-08    | 2523.00       | A             |
| 4967         | 37           | 318480       | 60           | 1998-10-14    | 5308.00       | D             |
| 4968         | 38           | 110736       | 48           | 1998-04-19    | 2307.00       | C             |
...

In order to construct the data model and for the feature learning algorithm to get the most out of your data, you have to assign roles to columns using the set_role() method (see Annotating data for details).

loan.set_role(["duration", "amount"], getml.data.roles.numerical)
loan.set_role(["loan_id", "account_id"], getml.data.roles.join_key)
loan.set_role("date", getml.data.roles.time_stamp)
loan.set_role(["payments"], getml.data.roles.target)

print(loan)
| date                        | loan_id  | account_id | default | payments  | duration  | amount    | status        |
| time stamp                  | join key | join key   | target  | numerical | numerical | numerical | unused string |
---------------------------------------------------------------------------------------------------------------------
| 1994-01-05T00:00:00.000000Z | 4959     | 2          | 0       | 3373      | 24        | 80952     | A             |
| 1996-04-29T00:00:00.000000Z | 4961     | 19         | 1       | 2523      | 12        | 30276     | B             |
| 1997-12-08T00:00:00.000000Z | 4962     | 25         | 0       | 2523      | 12        | 30276     | A             |
| 1998-10-14T00:00:00.000000Z | 4967     | 37         | 1       | 5308      | 60        | 318480    | D             |
| 1998-04-19T00:00:00.000000Z | 4968     | 38         | 0       | 2307      | 48        | 110736    | C             |
...

Finally, we are able to construct the data model by deriving Placeholder from each DataFrame and establishing relations between them using the join() method.

# But, first, we need second data set to build a data model.
trans = getml.DataFrame.from_db(
    'trans', 'df_trans',
    roles = {getml.data.roles.numerical: ["amount", "balance"],
             getml.data.roles.categorical: ["type", "bank", "k_symbol",
                                            "account", "operation"],
             getml.data.roles.join_key: ["account_id"],
             getml.data.roles.time_stamp: ["date"]
    }
)

ph_loan = loan.to_placeholder()
ph_trans = trans.to_placeholder()

ph_loan.join(ph_trans, join_key="account_id",
            time_stamp="date")

The data model contained in ph_loan can now be used to construct a Pipeline.

Classes

Container([population, peripheral, split, ...])

A container holds the actual data in the form of a DataFrame or a View.

DataFrame(name[, roles])

Handler for the data stored in the getML engine.

DataModel(population)

Abstract representation of the relationship between tables.

Placeholder(name[, roles])

Abstract representation of tables and their relations.

Roles(categorical, join_key, numerical, ...)

Roles can be passed to DataFrame to predefine the roles assigned to certain columns.

StarSchema([population, alias, peripheral, ...])

A StarSchema is a simplifying abstraction that can be used for machine learning problems that can be organized in a simple star schema.

Subset(container_id, peripheral, population)

A Subset consists of a population table and one or several peripheral tables.

TimeSeries(population, time_stamps[, alias, ...])

A TimeSeries is a simplifying abstraction that can be used for machine learning problems on time series data.

View(base[, name, subselection, added, dropped])

A view is a lazily evaluated, immutable representation of a DataFrame.

Functions

concat(name, data_frames)

Creates a new data frame by concatenating a list of existing ones.

delete(name)

If a data frame named 'name' exists, it is deleted.

exists(name)

Returns true if a data frame named 'name' exists.

list_data_frames()

Lists all available data frames of the project.

load_container(container_id)

Loads a container and all associated data frames from disk.

load_data_frame(name)

Retrieves a DataFrame handler of data in the getML engine.

to_placeholder(*args, **kwargs)

Factory function for extracting placeholders from a DataFrame or View.

Submodules

access

Manages the access to various data sources.

columns

Handlers for 1-d arrays storing the data of an individual variable.

relationship

Marks the relationship between joins in Placeholder

roles

A role determines if and how columns are handled during the construction of the DataModel and used by the feature learning algorithm (see feature_learning).

split

Helps you split data into a training, testing, validation or other sets.

time

Convenience functions for the handling of time stamps.

concat(name, data_frames)

Creates a new data frame by concatenating a list of existing ones.

delete(name)

If a data frame named 'name' exists, it is deleted.

exists(name)

Returns true if a data frame named 'name' exists.

load_container(container_id)

Loads a container and all associated data frames from disk.

load_data_frame(name)

Retrieves a DataFrame handler of data in the getML engine.

list_data_frames()

Lists all available data frames of the project.

to_placeholder(*args, **kwargs)

Factory function for extracting placeholders from a DataFrame or View.

Submodules

columns

Handlers for 1-d arrays storing the data of an individual variable.

access

Manages the access to various data sources.

relationship

Marks the relationship between joins in Placeholder

roles

A role determines if and how columns are handled during the construction of the DataModel and used by the feature learning algorithm (see feature_learning).

split

Helps you split data into a training, testing, validation or other sets.

subroles

Subroles allow for more fine-granular control of how certain columns will be used by the pipeline.

time

Convenience functions for the handling of time stamps.