Deployment¶
The results of the feature learning and the prediction can be retrieved in different ways and formats.
Returning Python objects
Using the
transform()
andpredict()
methods of a trainedPipeline
you can access both the features and the predictions asnumpy.ndarray
via the Python API.
Writing into a database
You can also write both features and prediction results back into a new table of the connected database by providing the
table_name
argument in thetransform()
andpredict()
methods. Please refer to the unified import interface for information on how to connect a database.
Responding to a HTTP(S) POST request
The getML suite also offers per pipeline HTTP(S) endpoints you can post new data to via a JSON string and retrieve either the resulting features or the predictions. The remainder of this chapter covers how this is done in detail.
Prerequisites¶
Let’s begin with an example pipeline we can use to illustrate how the endpoints work.
from getml.feature_learning import aggregations
from getml import data
from getml import datasets
from getml import engine
from getml import feature_learning
from getml.feature_learning import loss_functions
from getml import pipeline
from getml import predictors
engine.set_project("examples")
# ----------------
# Generate artificial dataset
population_table, peripheral_table = datasets.make_numerical()
# ----------------
# Construct placeholders
population_placeholder = data.Placeholder("POPULATION")
peripheral_placeholder = data.Placeholder("PERIPHERAL")
population_placeholder.join(
peripheral_placeholder,
"join_key",
"time_stamp",
memory=10.0)
predictor = predictors.LinearRegression()
# ----------------
fe1 = feature_learning.MultirelModel(
aggregation=[
aggregations.Count,
aggregations.Sum
],
loss_function=loss_functions.SquareLoss,
num_features=10,
share_aggregations=1.0,
max_length=1,
num_threads=0
)
# ----------------
fe2 = feature_learning.RelboostModel(
loss_function=loss_functions.SquareLoss,
num_features=10
)
# ----------------
pipe = pipeline.Pipeline(
tags=["multiple targets", "multirel", "relboost"],
population=population_placeholder,
peripheral=[peripheral_placeholder],
feature_learners=[fe1, fe2],
predictors=[predictor]
)
# ----------------
pipe = pipe.fit(
population_table,
peripheral_table
)
# ----------------
# Before you can send HTTP requests to the pipeline,
# you need to deploy it.
pipe.deploy(True)
Both the data model and the data itself are quite similar to what has been used during the getting started guide and their overall structure look like this
print(population_data_frame)
| time_stamp | join_key | targets | column_01 |
| time stamp | join key | target | numerical |
-----------------------------------------------------------------
| 1970-01-01T05:33:24.789405Z | 0 | 54 | 0.4544 |
| 1970-01-01T20:37:18.369208Z | 1 | 151 | -0.422029 |
| ... | ... | ... | ... |
and
print(peripheral_data_frame)
| time_stamp | join_key | column_01 |
| time stamp | join key | numerical |
------------------------------------------------------
| 1970-01-01T02:16:34.823342Z | 397 | 0.938692 |
| 1970-01-01T04:35:12.660807Z | 146 | -0.378773 |
| ... | ... | ... |
HTTP(S) Endpoints¶
As soon as you have trained a pipeline,
whitelisted it for external access using its
deploy()
method, and configured
the getML monitor for remote access, you can
both transform new data into features or make predictions on them
using these endpoints
transform endpoint: https://route-to-your-monitor:1710/transform/PIPELINE_NAME
predict endpoint: https://route-to-your-monitor:1710/predict/PIPELINE_NAME
To each of them you have to send a POST request containing the new data as a JSON string in a specific request format.
Note
For testing and developing purposes you can also use the HTTP port of the monitor to query the endpoints. Note that this is only possible within the same host. The corresponding syntax is http://localhost:1709/predict/PIPELINE_NAME
Request Format¶
In all POST requests to the endpoints, a JSON string with the following syntax has to be provided in the body:
{
"peripheral": [{
"column_1": [],
"column_2": []
},{
"column_1": [],
"column_2": []
}],
"population": {
"column_1": [],
"column_2": []
}
}
It has to have exactly two keys in the top level called
population
and peripheral
. These will contain the new
input data.
The order of the columns is irrelevant. They will be matched according to their
names. However, the order of the
individual peripheral tables is very important and has to exactly
match the order the corresponding Placeholder
have been provided in the constructor of pipeline
.
In our example above, we could post a JSON string like this:
{
"peripheral": [{
"column_01": [2.4, 3.0, 1.2, 1.4, 2.2],
"join_key": ["0", "0", "0", "0", "0"],
"time_stamp": [0.1, 0.2, 0.3, 0.4, 0.8]
}],
"population": {
"column_01": [2.2, 3.2],
"join_key": ["0", "0"],
"time_stamp": [0.65, 0.81]
}
}
Time stamp formats in requests¶
You might have noticed that the time stamps in the example above have been passed as numerical values and not as their string representations shown in the beginning. Both ways are supported by the getML monitor. But if you choose to pass the string representation, you also have to specify the particular format in order for the getML engine to interpret your data properly.
{
"peripheral": [{
"column_01": [2.4, 3.0, 1.2, 1.4, 2.2],
"join_key": ["0", "0", "0", "0", "0"],
"time_stamp": ["2010-01-01 00:15:00", "2010-01-01 08:00:00", "2010-01-01 09:30:00", "2010-01-01 13:00:00", "2010-01-01 23:35:00"]
}],
"population": {
"column_01": [2.2, 3.2],
"join_key": ["0", "0"],
"time_stamp": ["2010-01-01 12:30:00", "2010-01-01 23:30:00"]
},
"timeFormats": ["%Y-%m-%d %H:%M:%S"]
}
All special characters available for specifying the format of the time
stamps are listed and described in
e.g. getml.data.DataFrame.read_csv()
.
Using an existing DataFrame
¶
You can also use a
DataFrame
that already
exists on the getML engine:
{
"peripheral": [{
"df": "peripheral_table"
}],
"population": {
"column_01": [2.2, 3.2],
"join_key": ["0", "0"],
"time_stamp": [0.65, 0.81]
}
}
Using data from a database¶
You can also read the data from the connected database
(see unified import interface)
by passing an arbitrary query to the query
key:
{
"peripheral": [{
"query": "SELECT * FROM PERIPHERAL WHERE join_key = '0';"
}],
"population": {
"column_01": [2.2, 3.2],
"join_key": ["0", "0"],
"time_stamp": [0.65, 0.81]
}
}
Transform Endpoint¶
The transform endpoint returns the generated features.
https://route-to-your-monitor:1710/transform/PIPELINE_NAME
Such an HTTP(S) request can be send in many languages. For
illustration purposes we will use the command line tool curl
,
which comes preinstalled on both Linux and macOS. Also, we will use
the HTTP port via localhost (only possible for terminals running on
the same machine as the getML monitor) for better reproducibility.
curl --header "Content-Type: application/json" \
--request POST \
--data '{"peripheral":[{"column_01":[2.4,3.0,1.2,1.4,2.2],"join_key":["0","0","0","0","0"],"time_stamp":[0.1,0.2,0.3,0.4,0.8]}],"population":{"column_01":[2.2,3.2],"join_key":["0","0"],"time_stamp":[0.65,0.81]}}' \
http://localhost:1709/transform/PIPELINE_NAME
Predict Endpoint¶
When using getML as an end-to-end data science pipeline, you can use the predict endpoint to upload new, unseen data and receive the resulting predictions as response via HTTP(S).
https://route-to-your-monitor:1710/predict/PIPELINE_NAME
Such an HTTP(S) request can be send in many languages. For
illustration purposes we will use the command line tool curl
,
which comes preinstalled on both Linux and macOS. Also, we will use
the HTTP port via localhost (only possible for terminals running on
the same machine as the getML monitor) for better reproducibility.
curl --header "Content-Type: application/json" \
--request POST \
--data '{"peripheral":[{"column_01":[2.4,3.0,1.2,1.4,2.2],"join_key":["0","0","0","0","0"],"time_stamp":[0.1,0.2,0.3,0.4,0.8]}],"population":{"column_01":[2.2,3.2],"join_key":["0","0"],"time_stamp":[0.65,0.81]}}' \
http://localhost:1709/predict/PIPELINE_NAME