Importing data

Before being able to analyze and process your data using the getML software, you have to import it into the engine. At the end of this step you will have your data in data frame objects in the getML engine and will be ready to annotate them (refer to annotating data).

Note

If you have imported your data into the engine before and want to restore it, refer to Lifecycle of a DataFrame

Unified import interface

The getML Python API provides an unified import interface requiring similar arguments and resulting in the same output format, regardless of the data source.

You can use one of the dedicated from_csv(), from_pandas(), from_db(), and from_json() class methods to construct a data frame object in the getML engine, fill it with the provided data, and retrieve a DataFrame handle in the Python API.

If you already have a data frame object in place, you can use the read_csv(), read_pandas(), read_db(), or read_json() methods of the corresponding DataFrame handle to either replace its content with new data or append to it.

All those functions also have their counterparts for exporting called to_csv(), to_pandas(), to_db(), and to_json().

The particularities of the individual formats will be covered in the following sections:

Data Frames

The resulting DataFrame instance in the Python API represents a handle to the corresponding data frame object in the getML engine. The mapping between the two is done based on the name of the object, which has to be unique. Similarly, the names of the columns are required to be unique within the data frame they are associated with.

Handling of NULL values

Unfortunately, data sources often contain missing or corrupt data - also called NULL values. getML is able to work with missing values except for the target variable, which must not contain any NULL values (because having NULL targets does not make any sense). Please refer to the section on join keys for details about their handling during the construction of the data model.

During import a NULL value is automatically inserted at all occurrences of the strings “nan”, “None”, “NA”, or an empty string as well as at all occurrences of None and NaN.