Before being able to analyze and process your data using the getML software, you have to import it into the engine. At the end of this step you will have your data in data frame objects in the getML engine and will be ready to annotate them (refer to annotating data).
If you have imported your data into the engine before and want to restore it, refer to Lifecycle of a DataFrame
Unified import interface¶
The getML Python API provides an unified import interface requiring similar arguments and resulting in the same output format, regardless of the data source.
You can use one of the dedicated
from_json() class methods to construct a
data frame object in the getML engine, fill it with the provided data,
and retrieve a
DataFrame handle in the Python
If you already have a data frame object in place, you
can use the
read_json() methods of the corresponding
DataFrame handle to either replace its content
with new data or append to it.
The particularities of the individual formats will be covered in the following sections:
DataFrame instance in the Python
API represents a handle to the corresponding data frame object in the
getML engine. The mapping between the two is done based on
the name of the object, which has to be unique. Similarly, the names of
columns are required to be
unique within the data frame they are associated with.
Handling of NULL values¶
Unfortunately, data sources often contain missing or corrupt data - also called NULL values. getML is able to work with missing values except for the target variable, which must not contain any NULL values (because having NULL targets does not make any sense). Please refer to the section on join keys for details about their handling during the construction of the data model.
During import a NULL value is automatically inserted at all
occurrences of the strings “nan”, “None”, “NA”, or an empty string as
well as at all occurrences of