getml.data¶
Provides functionalities for importing, handling, and retrieving data from the getML engine.
All data relevant for the getML suite has to be present in the getML
engine. Its Python API itself does not store any of the data used for
training or prediction. Instead, it provides a handler class for the
data frame objects in the getML engine, the
DataFrame
. Either using this overall handler for
the underlying data set or the individual columns
its composed of, one can both import and retrieve data from the engine
as well as performing operations on them. In addition to the data
frame objects, the engine also uses an abstract and light weight
version of the underlying data model, which is represented by the
Placeholder
.
In general, working with data within the getML suite is organized in three different steps.
Importing the data into the getML engine .
Annotating the data by assign
roles
to the individualcolumns
Constructing the data model by deriving
Placeholder
from the data and joining them to represent the data schema.
Examples
Creating a new data frame object in the getML engine and importing
data is done by one the class methods
from_csv()
,
from_db()
,
from_json()
, or
from_pandas()
.
In this example we chose to directly load data from a public database in the internet. But, firstly, we have to connect the getML engine to the database (see MySQL interface in the user guide for further details).
getml.database.connect_mysql(
host="relational.fit.cvut.cz",
port=3306,
dbname="financial",
user="guest",
password="relational",
time_formats=['%Y/%m/%d']
)
Using the established connection, we can tell the engine to
construct a new data frame object called ‘df_loan’, fill it with
the data of ‘loan’ table contained in the MySQL database, and
return a DataFrame
handler associated with
it.
loan = getml.data.DataFrame.from_db('loan', 'df_loan')
print(loan)
...
| loan_id | account_id | amount | duration | date | payments | status |
| unused float | unused float | unused float | unused float | unused string | unused string | unused string |
-------------------------------------------------------------------------------------------------------------
| 4959 | 2 | 80952 | 24 | 1994-01-05 | 3373.00 | A |
| 4961 | 19 | 30276 | 12 | 1996-04-29 | 2523.00 | B |
| 4962 | 25 | 30276 | 12 | 1997-12-08 | 2523.00 | A |
| 4967 | 37 | 318480 | 60 | 1998-10-14 | 5308.00 | D |
| 4968 | 38 | 110736 | 48 | 1998-04-19 | 2307.00 | C |
...
In order to construct the data model and for the feature
learning algorithm to get the most out of your data, you have
to assign roles to columns using the
set_role()
method (see
Annotating data for details).
loan.set_role(["duration", "amount"], getml.data.roles.numerical)
loan.set_role(["loan_id", "account_id"], getml.data.roles.join_key)
loan.set_role("date", getml.data.roles.time_stamp)
loan.set_role(["payments"], getml.data.roles.target)
print(loan)
| date | loan_id | account_id | default | payments | duration | amount | status |
| time stamp | join key | join key | target | numerical | numerical | numerical | unused string |
---------------------------------------------------------------------------------------------------------------------
| 1994-01-05T00:00:00.000000Z | 4959 | 2 | 0 | 3373 | 24 | 80952 | A |
| 1996-04-29T00:00:00.000000Z | 4961 | 19 | 1 | 2523 | 12 | 30276 | B |
| 1997-12-08T00:00:00.000000Z | 4962 | 25 | 0 | 2523 | 12 | 30276 | A |
| 1998-10-14T00:00:00.000000Z | 4967 | 37 | 1 | 5308 | 60 | 318480 | D |
| 1998-04-19T00:00:00.000000Z | 4968 | 38 | 0 | 2307 | 48 | 110736 | C |
...
Finally, we are able to construct the data model by deriving
Placeholder
from each
DataFrame
and establishing relations between
them using the join()
method.
# But, first, we need second data set to build a data model.
trans = getml.data.DataFrame.from_db(
'trans', 'df_trans',
roles = {getml.data.roles.numerical: ["amount", "balance"],
getml.data.roles.categorical: ["type", "bank", "k_symbol",
"account", "operation"],
getml.data.roles.join_key: ["account_id"],
getml.data.roles.time_stamp: ["date"]
}
)
ph_loan = loan.to_placeholder()
ph_trans = trans.to_placeholder()
ph_loan.join(ph_trans, join_key="account_id",
time_stamp="date")
The data model contained in ph_loan can now be used to
construct a Pipeline
.
These data models are agnostic to the actual
data, which is not required until the actual training and
prediction.
Functions¶
|
Creates a new data frame by concatenating a list of existing ones. |
|
If a data frame named ‘name’ exists, it is deleted. |
|
Returns true if a data frame named ‘name’ exists. |
|
Retrieves a |
Lists all available data frames of the project. |
Classes¶
|
Handler for the data stored in the getML engine. |
|
Abstract representation of tables and their relations. |